ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
133 stars 34 forks source link

add multi-label support #298

Open louis-huang opened 9 months ago

louis-huang commented 9 months ago

Hi I added support to allow label as a list. So we can support reading data with multiple labels. This can then solve https://github.com/ray-project/xgboost_ray/issues/286. I verified new unit tests pass. Also test_matrix.py all pass with my local set up. I verified locally by training a xgboost model with parquet data format, it works well. So far it should work well for parquet data format. Thank you!

louis-huang commented 9 months ago

I verified the change works with the blow code example:

from sklearn.datasets import make_multilabel_classification
import pandas as pd
import numpy as np
n_classes = 5
random_state = 0
X, y = make_multilabel_classification(n_samples=32, n_classes=5, n_labels=3, random_state=random_state)
features = [f"f{i}" for i in range(len(X[0]))]
labels = [f"label_{i}" for i in range(n_classes)]

X_df = pd.DataFrame(X, columns = features)
y_df = pd.DataFrame(y, columns = labels)
data = pd.concat([X_df, y_df], axis = 1)

data.to_parquet("~/Desktop/sample_data/data.parquet")

from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
n_classes = 5
features = [f"f{i}" for i in range(20)]

labels = [f"label_{i}" for i in range(n_classes)]

training_data = "~/Desktop/sample_data"
train_set = RayDMatrix(training_data, labels, columns = features + labels, filetype=RayFileType.PARQUET)

evals_result = {}
bst = train(
    {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "random_state": random_state,
    },
    train_set,
    num_boost_round = 1,
    evals_result=evals_result,
    evals=[(train_set, "train")],
    verbose_eval=False,
    ray_params=RayParams(
        num_actors=1,  # Number of remote actors
        cpus_per_actor=1))

#bst.save_model("model.xgb")
#print("Final training error: {:.4f}".format(
#    evals_result["train"]["error"][-1]))

from xgboost_ray import predict
pred_ray = predict(bst, train_set, ray_params=RayParams(num_actors=1))
print(pred_ray)

import xgboost as xgb

clf = xgb.XGBClassifier(tree_method="hist", n_estimators = 1, random_state=0)
clf.fit(X, y)
expected = clf.predict_proba(X)

np.testing.assert_allclose(expected, pred_ray)
heyitsmui commented 9 months ago

@Yard1 can you help take a look when you get a chance? thanks!

louis-huang commented 7 months ago

Hi @Yard1 may I ask how to fix the lint test? Seems it still blocks the merge. Thank you!

Yard1 commented 7 months ago

Can you run the ./format.sh script in the root of the repo?

yc2984 commented 3 months ago

@louis-huang can you please run the above test please?