ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
139 stars 34 forks source link

Would you provide a technical architecture diagram? #80

Open Matrix-World opened 3 years ago

Matrix-World commented 3 years ago

In the README.md, I only saw some usage and case introductions, but I don't know how xgboost and ray are combined. I tried to read the source code and drew the architecture diagram myself, but it was not particularly clear. So can you provide one? Can illustrate the process of an app running. Looking forward your reply.

krfricke commented 3 years ago

Hi @Matrix-World, we're preparing a short paper/poster for XGBoost-Ray anyway, so we might end up doing something like this eventually. Is there something that you're particularly interested in and find hard to understand?

Matrix-World commented 3 years ago

Hi @Matrix-World, we're preparing a short paper/poster for XGBoost-Ray anyway, so we might end up doing something like this eventually. Is there something that you're particularly interested in and find hard to understand?

yes, When will you publish your paper? There is a problem that I find very strange. I successfully ran the py script(https://github.com/ray-project/xgboost_ray/blob/master/examples/train_on_test_data.py) and saw the usage of CPU and Plasma in the Ray Web. But when I set gpus_per_actor=1 and num_actor=4, I did not see the gpu in use, and no error was reported, and the code ran successfully. Can you answer it? --------below is my code---------

import argparse
import os
import shutil
import time

from xgboost_ray import train, RayDMatrix, RayParams
from xgboost_ray.tests.utils import create_parquet_in_tempdir
def main(fname, num_actors, cpus_per_actor):
    dtrain = RayDMatrix(
        os.path.abspath(fname), label="labels", ignore=["partition"])

    config = {
        "tree_method": "hist",
        "eval_metric": ["logloss", "error"],
    }

    evals_result = {}

    start = time.time()
    bst = train(
        config,
        dtrain,
        evals_result=evals_result,
        ray_params=RayParams(max_actor_restarts=1,
                             num_actors=num_actors,
                             gpus_per_actor=1,
                             cpus_per_actor=cpus_per_actor),
        num_boost_round=50,
        evals=[(dtrain, "train")])

    taken = time.time() - start
    print(f"Train Time Taken: {taken:.2f} seconds")

    bst.save_model("test_data_for_xgboost.xgb")
    print("Final training error: {:.4f}".format(
        evals_result["train"]["error"][-1]))

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test",
        action="store_true",
        default=True,
        help="Finish quickly for testing")
    parser.add_argument(
        "--num-actors",
        type=int,
        default=4,
        help="Sets number of xgboost workers to use.")
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per xgboost training worker.")
    args = parser.parse_args()

    temp_dir, path = None, None
    if args.smoke_test:
        temp_dir, path = create_parquet_in_tempdir(
            "smoketest.parquet",
            num_rows=1_000000,
            num_features=36,
            num_classes=2,
            num_partitions=20)
    else:
        path = os.path.join(os.path.dirname(__file__), "parted.parquet")

    import ray
    ray.init(num_cpus=args.num_actors,dashboard_host="10.3.68.117",dashboard_port=8888)

    start = time.time()
    main(path,args.num_actors,args.cpus_per_actor)
    taken = time.time() - start
    print(f"Total Time Taken: {taken:.2f} seconds")

    if args.smoke_test:
        shutil.rmtree(temp_dir)
krfricke commented 3 years ago

gpus_per_actor assigns GPUs to the remote worker - but you will have to tell xgboost to actually use GPUs. You can do that by passing "tree_method": "gpu_hist" in the xgboost config.

Actually, we should probably throw a warning when GPUs are allocated but gpu hist is not used. I'll file a short PR for that.

krfricke commented 3 years ago

As for the paper, it will probably take a couple of weeks

penolove commented 2 years ago

are any updates about this?