intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.26k stars 1.23k forks source link

AutoML XGBoost model abnormal memory usage as the size of dataset grows from 10w to 10M #7716

Open lalalapotter opened 1 year ago

lalalapotter commented 1 year ago

Test AutoML XGBoost Classifier example in Almaren Yarn Cluster(cluster mode), with sparse datasets from 100,000 rows(0.7GB) to 10 million rows (72GB) generated by scripts. Found that the memory usage is abnormally scale up as the size of dataset grows Corresponding test results are as following:

image

Otherwise, the application report following error:

(ImplicitFunc pid=18420, ip=172.16.0.135) 2023-02-28 20:16:58,965  ERROR function_runner.py:268 -- Runner Thread raised error.
(ImplicitFunc pid=18420, ip=172.16.0.135) Traceback (most recent call last):
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
(ImplicitFunc pid=18420, ip=172.16.0.135)     self._entrypoint()
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
(ImplicitFunc pid=18420, ip=172.16.0.135)     self._status_reporter.get_checkpoint())
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk0/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000001/environment/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk0/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000001/environment/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk0/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000001/environment/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 352, in train_func
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/bigdl/orca/automl/xgboost/XGBoost.py", line 158, in fit_eval
(ImplicitFunc pid=18420, ip=172.16.0.135)     self.model.fit(x, y, eval_set=eval_set, eval_metric=metric_name)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 575, in inner_f
(ImplicitFunc pid=18420, ip=172.16.0.135)     return f(**kwargs)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/sklearn.py", line 1397, in fit
(ImplicitFunc pid=18420, ip=172.16.0.135)     enable_categorical=self.enable_categorical,
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/sklearn.py", line 457, in _wrap_evaluation_matrices
(ImplicitFunc pid=18420, ip=172.16.0.135)     enable_categorical=enable_categorical,
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/sklearn.py", line 1396, in <lambda>
(ImplicitFunc pid=18420, ip=172.16.0.135)     create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 575, in inner_f
(ImplicitFunc pid=18420, ip=172.16.0.135)     return f(**kwargs)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 692, in __init__
(ImplicitFunc pid=18420, ip=172.16.0.135)     enable_categorical=enable_categorical,
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/data.py", line 881, in dispatch_data_backend
(ImplicitFunc pid=18420, ip=172.16.0.135)     feature_types)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/data.py", line 187, in _from_numpy_array
(ImplicitFunc pid=18420, ip=172.16.0.135)     ctypes.byref(handle),
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 246, in _check_call
(ImplicitFunc pid=18420, ip=172.16.0.135)     raise XGBoostError(py_str(_LIB.XGBGetLastError()))
(ImplicitFunc pid=18420, ip=172.16.0.135) xgboost.core.XGBoostError: std::bad_alloc
(ImplicitFunc pid=18420, ip=172.16.0.135) Exception in thread Thread-2:
(ImplicitFunc pid=18420, ip=172.16.0.135) Traceback (most recent call last):
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(ImplicitFunc pid=18420, ip=172.16.0.135)     self.run()
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/ray/tune/function_runner.py", line 281, in run
(ImplicitFunc pid=18420, ip=172.16.0.135)     raise e
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
(ImplicitFunc pid=18420, ip=172.16.0.135)     self._entrypoint()
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
(ImplicitFunc pid=18420, ip=172.16.0.135)     self._status_reporter.get_checkpoint())
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk0/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000001/environment/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk0/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000001/environment/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk0/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000001/environment/lib/python3.7/site-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 352, in train_func
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/bigdl/orca/automl/xgboost/XGBoost.py", line 158, in fit_eval
(ImplicitFunc pid=18420, ip=172.16.0.135)     self.model.fit(x, y, eval_set=eval_set, eval_metric=metric_name)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 575, in inner_f
(ImplicitFunc pid=18420, ip=172.16.0.135)     return f(**kwargs)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/sklearn.py", line 1397, in fit
(ImplicitFunc pid=18420, ip=172.16.0.135)     enable_categorical=self.enable_categorical,
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/sklearn.py", line 457, in _wrap_evaluation_matrices
(ImplicitFunc pid=18420, ip=172.16.0.135)     enable_categorical=enable_categorical,
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/sklearn.py", line 1396, in <lambda>
(ImplicitFunc pid=18420, ip=172.16.0.135)     create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 575, in inner_f
(ImplicitFunc pid=18420, ip=172.16.0.135)     return f(**kwargs)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 692, in __init__
(ImplicitFunc pid=18420, ip=172.16.0.135)     enable_categorical=enable_categorical,
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/data.py", line 881, in dispatch_data_backend
(ImplicitFunc pid=18420, ip=172.16.0.135)     feature_types)
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/data.py", line 187, in _from_numpy_array
(ImplicitFunc pid=18420, ip=172.16.0.135)     ctypes.byref(handle),
(ImplicitFunc pid=18420, ip=172.16.0.135)   File "/disk2/yarn/nm/usercache/kai/appcache/application_1668477395550_1326/container_1668477395550_1326_01_000004/environment/lib/python3.7/site-packages/xgboost/core.py", line 246, in _check_call
(ImplicitFunc pid=18420, ip=172.16.0.135)     raise XGBoostError(py_str(_LIB.XGBGetLastError()))
(ImplicitFunc pid=18420, ip=172.16.0.135) xgboost.core.XGBoostError: std::bad_alloc
(ImplicitFunc pid=18420, ip=172.16.0.135)
hkvision commented 1 year ago

@TheaperDeng Take a look?