h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

XGBoost: "NCCL failure :cuda malloc failed" memory allocation crash on munged BNPParibas #11874

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

h2o-3 crashes with the following stacktrace when XGBoost is run on BNPParibas as munged by autodl 0.9.1. This is with h2o-3 built from [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223]'s branch {{mm/xgb_upgrade}}, which updates to the latest XGBoost. Note that this is similar but not identical to PUBDEV-4997. The identical model build failed for me a couple times the PUBDEV-4997 way and once the PUBDEV-4998 way. :-)

The dataset, logfile and a repro Python script are here: {{mr-dl2:/home/rpeck/XGBoost-realloc-crash-2017.10.11}}

{quote} [23:21:14] /home/michal/dev/xgboost/dmlc-core/include/dmlc/logging.h:300: [23:21:14] /home/michal/dev/xgboost/src/tree/updater_gpu_hist.cu:286: GPU plugin exception: NCCL failure :cuda malloc failed /home/michal/dev/xgboost/src/tree/updater_gpu_hist.cu(318)

Stack trace returned 7 entries:
[bt] (0) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5e65ef043c]
[bt] (1) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost4tree12GPUHistMaker6UpdateERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixERKS2_IPNS_7RegTreeESaISD_EE+0x15e) [0x7f5e661496ee]
[bt] (2) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixEiPS2_ISt10unique_ptrINS_7RegTreeESt14default_deleteISD_EESaISG_EE+0x9ce) [0x7f5e65f1a21e]
[bt] (3) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPSt6vectorINS_6detail18bst_gpair_internalIfEESaIS7_EEPNS_11ObjFunctionE+0xb50) [0x7f5e65f1b640]
[bt] (4) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x22b) [0x7f5e6609a20b]
[bt] (5) /tmp/libxgboost4j_gpu8406566776870335554.so(XGBoosterUpdateOneIter+0x27) [0x7f5e6606bde7]
[bt] (6) [0x7f5ece19210d]
{quote}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4998 Assignee: Rory Mitchell Reporter: Raymond Peck State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A