h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 1.99k forks source link

XGBoost: realloc() memory allocation crash on munged BNPParibas #11873

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

h2o-3 crashes with the following stacktrace when XGBoost is run on BNPParibas as munged by autodl 0.9.1. This is with h2o-3 built from [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223]'s branch {{mm/xgb_upgrade}}, which updates to the latest XGBoost.

I have saved the dataset, logfile and a repro Python script here: {{mr-dl2:/home/rpeck/XGBoost-realloc-crash-2017.10.11}}

{quote} Error in `java': realloc(): invalid next size: 0x00007fd204280850
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fd49f0207e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x834aa)[0x7fd49f02c4aa]
/lib/x86_64-linux-gnu/libc.so.6(realloc+0x179)[0x7fd49f02d839]
/usr/lib/x86_64-linux-gnu/libpciaccess.so.0(+0x3c96)[0x7fd400ac8c96]
/usr/lib/x86_64-linux-gnu/libpciaccess.so.0(+0x3ef2)[0x7fd400ac8ef2]
/usr/lib/x86_64-linux-gnu/libpciaccess.so.0(pci_device_get_device_name+0x54)[0x7fd400ac9094]
/usr/lib/nvidia-375/libnvidia-ml.so(+0xf7e6c)[0x7fd3e8184e6c]
/usr/lib/nvidia-375/libnvidia-ml.so(+0xe9106)[0x7fd3e8176106]
/usr/lib/nvidia-375/libnvidia-ml.so(+0xc4cc)[0x7fd3e80994cc]
/usr/lib/nvidia-375/libnvidia-ml.so(nvmlDeviceGetCpuAffinity+0x338)[0x7fd3e80ccb88]
/usr/lib/nvidia-375/libnvidia-ml.so(nvmlDeviceSetCpuAffinity+0x21e)[0x7fd3e80ccfce]
/tmp/libxgboost4j_gpu4238844599563539840.so(_Z28wrapNvmlDeviceSetCpuAffinityP13nvmlDevice_st+0x14)[0x7fd42a154964]
/tmp/libxgboost4j_gpu4238844599563539840.so(ncclCommInitAll+0x49a)[0x7fd42a14d9da]
/tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost4tree12GPUHistMaker8InitDataERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EERNS_7DMatrixERKNS_7RegTreeE+0xa27)[0x7fd42a1417a7]
/tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost4tree12GPUHistMaker10UpdateTreeERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixEPNS_7RegTreeE+0x4e)[0x7fd42a14503e]
/tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost4tree12GPUHistMaker6UpdateERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixERKS2_IPNS_7RegTreeESaISD_EE+0x8a)[0x7fd42a14961a]
/tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixEiPS2_ISt10unique_ptrINS_7RegTreeESt14default_deleteISD_EESaISG_EE+0x9ce)[0x7fd429f1a21e]
/tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPSt6vectorINS_6detail18bst_gpair_internalIfEESaIS7_EEPNS_11ObjFunctionE+0xb50)[0x7fd429f1b640]
/tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x22b)[0x7fd42a09a20b]
/tmp/libxgboost4j_gpu4238844599563539840.so(XGBoosterUpdateOneIter+0x27)[0x7fd42a06bde7]
[0x7fd489017b94] {quote}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4997 Assignee: Rory Mitchell Reporter: Raymond Peck State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A