ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
139 stars 34 forks source link

Sort qid or group col when ingesting data partition into raydmatrix for ranking #238

Closed heyitsmui closed 1 year ago

heyitsmui commented 2 years ago

Description xgboost expect the qid or group to be sorted (on a partition level) before getting passed into dmatrix, as a small enhancement, we can consider using ray data to sort as the data is getting ingested into each worker

Use case Convenience for users doing ranking on xgboost ray

heyitsmui commented 2 years ago

cc @atomic @Yard1

heyitsmui commented 2 years ago

@Yard1 this might be an interesting onboarding task for us, do you think @atomic can take this on with some guidance from you?

Yard1 commented 2 years ago

Yes, that would be great! Happy to help @atomic, let me assign you

We probably can just sort it with pandas on each worker right before training begins.

heyitsmui commented 1 year ago

I believe we can close this since the PR (https://github.com/ray-project/xgboost_ray/pull/239) is merged and xgboost-ray 1.12 includes this change. Thanks @Yard1 @atomic !