single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost

ray-project / xgboost_ray

Distributed XGBoost on Ray

Apache License 2.0

139 stars 34 forks source link

single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost #230

Closed faaany closed 2 years ago

faaany commented 2 years ago

Hi guys, I am trying to improve my xgboost training with xgboost-ray due to its amazing features, like multi-node training, distributed data loading etc. However, my experiment shows that training with xgboost is 3.5 faster than xgboost-ray on single node.

My configuration is as follows:

XGBoost-Ray Version: 0.1.10 XGBoost Version: 1.6.1 ray: 1.13.0

Training Data size: 18GB, with 51,995,904 rows and 174 columns. Cluster Machine: 160 Cores and 500 RAM

Ray Config: 1 actor and 154 cpus_per_actor Ray is initialized with ray.init() and the ray cluster is set up manually and the nodes are connected via Docker swarm (which might not be relevant for this issue)

Please let me know if there is more info needed from your side. Thanks a lot!

krfricke commented 2 years ago

Just to be sure, a few questions:

Are you comparing vanilla xgboost on 1 node with 160 cores vs. xgboost-ray on 1 node with 160 cores or 2 nodes with 160 cores each (I assume it's the latter?)
What is the absolute time training takes?
Where does the training data live? Is it one file or multiple files?

It would be great if you could post (parts of) your training script so we can look into how to optimize the data loading part of training.

Just as a test for your next run, can you set the n_params parameter of xgboost (in both xgboost-ray and vanilla xgboost) to the number of CPUs (e.g. 154 if using 154 cpus_per_actor)?

krfricke commented 2 years ago

Quick update, I've ran a benchmark on a single node 16 CPU instance with 10GB data and can't find any differences in training time with a single worker.

Vanilla xgboost

Running xgboost training benchmark...
run_xgboost_training takes 47.01420674300016 seconds.
Results: {'training_time': 47.01420674300016, 'prediction_time': 0.0}

XGBoost-Ray

UserWarning: `num_actors` in `ray_params` is smaller than 2 (1). XGBoost will NOT be distributed!
run_xgboost_training takes 43.593930259000444 seconds.
Results: {'training_time': 43.593930259000444, 'prediction_time': 0.0}

Note that this is with local data. I'll benchmark with two nodes next (possibly only next week though)

Yard1 commented 2 years ago

I've walked through it with @Faaany and it seems that for their use-case, running 5 actors with 30 CPUs on a single node gives major speedups over 1 actor with 154 CPUs, though still not as fast as vanilla XGBoost.

For 2 nodes, using 9 actors with 32 CPUs per actor is around 15% faster than vanilla XGBoost.

It seems that there is an upper celling of CPUs per actor and adding more beyond that gives diminishing returns compared to adding more actors. It's still not clear why vanilla XGBoost is so fast, though.

@Faaany, please confirm when you have a moment :)

xwjiang2010 commented 2 years ago

@Yard1 is there a script that resembles @Faaany's use case?

faaany commented 2 years ago

@Yard1 is correct. Regarding @krfricke 's questions at the beginning, here are the answers:

1) I am comparing vanilla xgboost on 1 node with 160 cores vs. xgboost-ray on 1 node with 160 cores (I mistyped the number of actors in the original post. I used 1 actor for single-node xgboost-ray). 2) the absolute training time takes about 8min vs. 35min. 3) the training data lives in local hard disk and is mounted into the docker container for training.

Do you mean the nthread param?

faaany commented 2 years ago

@xwjiang2010 my code is actually very simple, not that different from the xgboost-ray example code. I agree with @Yard1 's statement that there might be an upper ceiling of CPUs per actor. For a machine with 160 CPUs, it is not a good idea to set the number of actors to 1 and the number of CPUs close to 160.

On the other hand, the hardware I am using might not be that typical for other xgboost-ray users. So you may close this issue if you want.

faaany commented 2 years ago

As stated above, I need to tune the number of actors and number of cpus per actor to achieve better runtime performance with vanilla xgboost for my machine with 160 cores.