h2oai / h2o4gpu

H2Oai GPU Edition
Apache License 2.0
456 stars 96 forks source link

h2o4gpu :Genetic algorithm along with Random Forest Regression produces error: terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: out of memory #789

Open Geerthy11 opened 4 years ago

Geerthy11 commented 4 years ago

I am working on feature selection using Genetic Algorithm (GA) with Random forest regression model (h2o4gpu.RandomForest Regressor). The number of estimators is 100, rest of the parameters are default. Here, the fitness function for GA is RF model's MAE. My dataset is 1.51 MB and dimension is 4000*44. However, The following is the types of error i get after certain iterations (say 30-40) whenever i run the program:

terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: out of memory Aborted (core dumped)

terminate called after throwing an instance of 'dmlc::Error' what(): [08:58:38] /workspace/include/xgboost/./../../src/common/common.h:41: /workspace/src/tree/../common/device_helpers.cuh: 422: out of memory Stack trace: [bt] (0) /conda/envs/rapids/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f3f0b07fcb4] [bt] (1) /conda/envs/rapids/xgboost/libxgboost.so(+0x3267e2) [0x7f3f0b2a57e2] [bt] (2) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::DeviceShard<xgboost::detail::GradientPairInternal >::EvaluateSplits(std::vector<int, std::allocator >, xgboost::RegTree const&, unsigned long)+0x1041) [0x7f3f0b2b48b1] [bt] (3) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::DeviceShard<xgboost::detail::GradientPairInternal >::UpdateTree(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, xgboost::RegTree, dh::AllReducer)+0x131e) [0x7f3f0b2c7dfe] [bt] (4) /conda/envs/rapids/xgboost/libxgboost.so(+0x34a201) [0x7f3f0b2c9201] [bt] (5) /conda/envs/rapids/bin/../lib/libgomp.so.1(GOMP_parallel+0x42) [0x7f3f1c5bee92] [bt] (6) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree, std::allocator<xgboost::RegTree> > const&)+0x918) [0x7f3f0b2bae98] [bt] (7) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete > > >)+0xa81) [0x7f3f0b105791] [bt] (8) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0xd65) [0x7f3f0b106c95]

Aborted (core dumped)

The following are the specifications: Ubuntu 16.04.6 LTS Python 3.6.8 CUDA 10.2/ cuDNN -7.4.1 GPU model -Quadro GV100 Nvidia docker version : 18.09.6 RAM: 125 GB H2o4gpu is installed using PIP wheel for cuda 10.0 (https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.3-cuda10/h2o4gpu-0.3.2-cp36-cp36m-linux_x86_64.whl)

Kindly provide your suggestions to this issue.

sh1ng commented 4 years ago

Could you provide a code snippet to reproduce it?