ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
143 stars 34 forks source link

Add support for dask dataframes #99

Closed krfricke closed 3 years ago

krfricke commented 3 years ago

Closes #92

Note that the locality-aware scheduling can currently not be tested reliably. This is due to 2 reasons:

1) We cannot guarantee that dask partitions end up living on a specific node (to set an initial state for redistribution) 2) We cannot guarantee that tasks that determine the dask partition node are co-scheduled with the respective partition.

Once the ray.state.objects() API or something similar comes back, the second option should be easy to enable, and the first option will be easy to confirm.

krfricke commented 3 years ago

Currently it seems that the problem lies with 1) - ray memory shows that each partition lives on the head node after calling persist(). I'll try to find a solution to that.