marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

[Question] MariusGNN-EuroSys23: inconsistent CPU usage between config and running time behaviour #133

Closed initzhang closed 1 year ago

initzhang commented 1 year ago

Hi, thanks for this excellent work!

I am trying to reproduce some results of MariusGNN, and found some problems about the CPU usage and the configuration. Specifically, no matter what values I assign to OMP_NUM_THREADS, the actual training will consume up to 50% of all CPU cores. For example, using the default configuration of memory-based training of Ogbn-Papers100M, on my machine (with 64 CPU cores), although the number of threads is set to 8, the training CPU usage is as high as 3200%. I have tried to set values of batch_loader_threads,batch_transfer_threads,compute_threads,gradient_transfer_threads,gradient_update_threads, but the CPU usage is still as high as 3200%. Have you experienced such problems?

I am very grateful if you can help with this, thank you!

rogerwaleffe commented 1 year ago

Thanks for your question! It is a bit unclear to me whether you are aiming for higher than 3200% CPU usage or lower than 3200%. Either way, both should be achievable.

You are on the right track by trying to tune the OMP_NUM_THREADS and the pipeline configuration options. To minimize CPU usage you can set the OMP_NUM_THREADS to 1 and all the thread variables in the pipeline config to 1 also (or use sync training). To maximize CPU usage, a bit more tuning is generally required. But I would try increasing the OMP_NUM_THREADS, batch_loader_threads, and batch_transfer threads. The number of compute_threads should always be one and the number of gradient threads doesn't matter for node classification on Papers100M. It may also be necessary to raise the staleness_bound to increase the CPU usage.

A couple questions which might help me give a bit more detailed information: 1. Are you running this training on a machine with a GPU? 2. What branch were you using for training? We have been in the process of upgrading the performance of the main branch (see this PR) but haven't finalized some build issues with this PR yet. I'd expect the CPU usage will be higher with these improvements.

initzhang commented 1 year ago

Hi @rogerwaleffe, thanks for your reply!

I actually want to control the CPU usage somehow precisely for the fair comparison between MariusGNN and other frameworks. For example, I expect to limit CPU usage to 800% (at maximum) for all systems by setting OMP_NUM_THREADS to 8. But I didn't manage to do so for MariusGNN.

  1. yes, I am running MariusGNN under the single-GPU setting
  2. I am using the eurosys_2023_artifact branch on the c2c424 commit

I will take a try for later updates on the branch :) but could you help to explain how does the staleness_bound influence the CPU usage? In addition, what is the relationship between global OMP threads number and the other threads specified by batch_loader_threads etc. ?

rogerwaleffe commented 1 year ago

Ahh okay, I understand the goal now.

So the issue here is that MariusGNN uses OMP_NUM_THREADS for each batch_loader_thread. E.g. if you set batch_loader_threads=4 and OMP_NUM_THREADS=8, then you should expect up to 8*4=32 threads running in parallel (as each loader will call omp for loops which can spin up 8 threads). The batch_transfer_threads/compute_thread may also use a few CPU resources, but should be much less than the batch_loader_threads.

So if you would like to limit CPU usage to 800%, I would try either 1) batch_loader_threads OMP_NUM_THREADS < 8 or 2) batch_loader_threadsOMP_NUM_THREADS + batch_transfer_threads + compute_threads < 8.

With regards to the staleness_bound: this parameter controls how many batches can be in the pipeline at once. E.g. if you have a staleness bound of 1, then even if you have 2 batch loader threads which are meant to be preparing batches in parallel, only one of them will be able to run as only one batch is allowed in the pipeline. This can limit CPU usage if you were expecting two batch loader threads to run in parallel. In general, to maximize utilization the staleness bound should be "high enough", but this doesn't seem like it should be an issue for you since you are trying to cap CPU usage. The default 64 should be okay. You can limit CPU usage as described above.

Let me know if the above helps. I would also try your experiments on the latest eurosys_2023_artifact commit. The example config for Papers100M on the latest commit is available here and is only slightly different than the one for the c2c424 commit I believe.

initzhang commented 1 year ago

Hi @rogerwaleffe , thanks for the reply!

I think I understand the usage problem now, many thanks for your explanation!