ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
143 stars 34 forks source link

group parameter not being used in RayXGBRanker #186

Closed ramab1988 closed 2 years ago

ramab1988 commented 2 years ago

Hi!

I have been trying to use RayXGBRanker, but it seems like the group parameter is not being considered for building the model.

Attaching the following code to reproduce the same:

Data preparation

from xgboost_ray import RayXGBRanker, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np

seed = 42

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
   X, y, train_size=0.1, random_state=42
)
# This is done just to get random relevant values for ranker
y_train = np.random.rand(len(y_train))

1st inference

group=np.array([[56]])
clf = RayXGBRanker(
   n_jobs=2,
   random_state=seed
)
clf.fit(X_train, y_train, group=group, ray_dmatrix_params={})
pred_ray = clf.predict(X_test[:10])
print(pred_ray)

1st inference result: [-1.4302582 0.80665004 -1.656052 2.8729537 0.23208997 2.1761725 -1.1084297 -2.1263309 5.631699 -4.417554 ]

2nd inference

group=np.array([[20, 10, 26]])
clf = RayXGBRanker(
   n_jobs=2,
   random_state=seed
)
clf.fit(X_train, y_train, group=group, ray_dmatrix_params={})
pred_ray = clf.predict(X_test[:10])
print(pred_ray)

2nd inference result: [-1.4302582 0.80665004 -1.656052 2.8729537 0.23208997 2.1761725 -1.1084297 -2.1263309 5.631699 -4.417554 ]

As we can see, both 1st and 2nd inference pieces are giving the same predictions even though we have changed the group parameter. Could you guys please let me know any solution to this problem?

Versions
xgboost==1.5.2
xgboost-ray==0.1.6
ray==1.9.2

Thanks! Rama

Yard1 commented 2 years ago

I can reproduce this error. Working on a fix. Thanks for the report!

ramab1988 commented 2 years ago

Thanks!

On a side note, if we don't pass ray_dmatrix_params={} in the clf.fit line, we get the following error. Any idea why?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-7afbfa443f04> in <module>
      5 )
      6 # clf.fit(X_train, y_train, group=group, ray_dmatrix_params={})
----> 7 clf.fit(X_train, y_train, group=group)
      8 pred_ray = clf.predict(X_test[:10])
      9 print(pred_ray)

/opt/conda/lib/python3.6/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
    504         for k, arg in zip(sig.parameters, args):
    505             kwargs[k] = arg
--> 506         return f(**kwargs)
    507 
    508     return inner_f

/opt/conda/lib/python3.6/site-packages/xgboost_ray/sklearn.py in fit(self, X, y, group, qid, sample_weight, base_margin, eval_set, eval_group, eval_qid, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks, ray_params, _remote, ray_dmatrix_params)
    985                     **ray_dmatrix_params
    986                 }),
--> 987                 **self._ray_get_wrap_evaluation_matrices_compat_kwargs())
    988 
    989         evals_result = {}

/opt/conda/lib/python3.6/site-packages/xgboost/sklearn.py in _wrap_evaluation_matrices(missing, X, y, group, qid, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, base_margin_eval_set, eval_group, eval_qid, create_dmatrix, enable_categorical, label_transform)
    293         feature_weights=feature_weights,
    294         missing=missing,
--> 295         enable_categorical=enable_categorical,
    296     )
    297 

/opt/conda/lib/python3.6/site-packages/xgboost_ray/sklearn.py in <lambda>(**kwargs)
    983                 create_dmatrix=lambda **kwargs: RayDMatrix(**{
    984                     **kwargs,
--> 985                     **ray_dmatrix_params
    986                 }),
    987                 **self._ray_get_wrap_evaluation_matrices_compat_kwargs())

TypeError: 'NoneType' object is not a mapping
Yard1 commented 2 years ago

Ah, that's an oversight for the ranker. Will be fixed as well. Thank you!

Yard1 commented 2 years ago

@ramab1988 This will be fixed in the next release. However, you won't be able to use the group parameter - you will have to use the qid parameter instead, which is a vector mapping each row to the group they belong to. This is also the case for XGBoost-Dask.

ramab1988 commented 2 years ago

Great! Thanks a lot for the quick fix. When can we expect the next release?

Yard1 commented 2 years ago

@krfricke I think we can make a release before the end of week, what do you think?