H2O4GPU crashing jupyter kernel with message thrust::system::detail::bad_alloc

h2oai / h2o4gpu

H2Oai GPU Edition

Apache License 2.0

460 stars 95 forks source link

H2O4GPU crashing jupyter kernel with message thrust::system::detail::bad_alloc #762

Closed hemenkapadia closed 5 years ago

hemenkapadia commented 5 years ago

Customer is reporting H2O4GPU models (GBM and ElasticNet) produced an error message ‘terminate called after throwing an instance of 'thrust::system::detail::bad_alloc’’ and eventually broke jupyter notebook kernel. Retrying results in similar behavior.

The instance is a GCP VM with 32 vCPUs and 4 Tesla P4 GPUs. Dataset used is http://kt.ijs.si/elena_ikonomovska/data.html, which has about 116 million records and is 5.76 GB.

Is this related to #311 ?

sh1ng commented 5 years ago

bad_alloc occurs in case there's not enough memory. Not related to #311

I was able to run GBM on g3.16xlarge(4 GPUs with 8Gb of RAM as P4). Can I get code example? Also attached notebook that works.

Elastic net indeed fails at https://github.com/h2oai/h2o4gpu/blob/master/src/gpu/matrix/matrix_dense.cu#L1963 as a result of allocation attempt of the whole matrix on a single GPU. We need to improve this part.

sh1ng commented 5 years ago

Untitled.zip

sh1ng commented 5 years ago

If they use one-hot encoding for categories it's inefficient and should be avoided.

hemenkapadia commented 5 years ago

Hi @sh1ng , I shared the python notebook with you on slack.

sh1ng commented 5 years ago

xgboost part can be solved by setting 'n_gpus': -1, Don't really know why it's not set by default.

xgb_params = {'max_depth':8,
              'objective':'binary:logistic', 
              'min_child_weight':30, 
              'eta':0.1, #learning rate
              'scale_pos_weight':2, 
              'gamma':0.1, #min_split_loss
              'reg_lamda':0.5, #L2-regularization term
              'tree_method':'gpu_hist',
              'n_gpus': -1, }

Elastic net issue is going to be fixed in #763