ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
139 stars 34 forks source link

Does XGBoost_Ray make use of DataIter for the XGBoost External Memory feature? #304

Open daviddwlee84 opened 9 months ago

daviddwlee84 commented 9 months ago

I want to load larger than memory data which exceeds the cluster's entire memory summation.

To be specific, I want to take advantage of the feature Using XGBoost External Memory Version — xgboost 2.1.0-dev documentation and Experimental support for external memory — xgboost 2.1.0-dev documentation.

I found RayDataIter but seems it is only been used when it founds it is using legacy XGBoost (< 1.5.0 I think, without DataIter).

https://github.com/ray-project/xgboost_ray/blob/9081780c5826194b780fdad4dbe6872470527cab/xgboost_ray/matrix.py#L43-L49

https://github.com/ray-project/xgboost_ray/blob/9081780c5826194b780fdad4dbe6872470527cab/xgboost_ray/main.py#L365-L431

Maybe it is better that we can construct XGBoost DMatrix with customized DataIter instead of concatenating all the data at once.

https://github.com/ray-project/xgboost_ray/blob/9081780c5826194b780fdad4dbe6872470527cab/xgboost_ray/main.py#L423 https://github.com/ray-project/xgboost_ray/blob/9081780c5826194b780fdad4dbe6872470527cab/xgboost_ray/main.py#L351-L362