interTwin-eu / itwinai

Advanced AI workflows for digital twins applications in science.
https://itwinai.readthedocs.io
MIT License
14 stars 5 forks source link

Updated scalability report (more comprehensive and easier to use) #221

Open jarlsondre opened 3 weeks ago

jarlsondre commented 3 weeks ago

Summary

The scalability report that you get from itwinai scalability_report shows the relative speedup between running a job on a single node vs. on multiple nodes. However, this is not very comprehensive and has some issues with the user experience.

Metrics that should be included

Other improvements

jarlsondre commented 2 weeks ago

Suggested solution

Based on the literature around Horovod, DDP and DeepSpeed (as well as a couple of others, like OneFlow), it seems that most papers focus on throughput, measured as samples/sec or FLOPS, and that a select few (e.g. Horovod) seem to put some emphasis into GPU Utilization and time spent on communication vs computation. Since wall-clock time should be inversely proportional to throughput and is much easier to measure, I suggest we use this.

Note: Any data displayed in the following plots will be completely fabricated by me, so don't read into the numbers.

Throughput

We measure the scalability of throughput in two ways:

Communication vs. Computation

We measure communication vs. computation as a score from 0 to 1, where 0 means all the time was spent on communication and 1 means that all the time was spent on computation. An example can be seen here: image Note: The numbers 4, 8 and 16 refer to the number of GPUs in this plot. This is just a draft :)

GPU Utilization

Two key metrics:

matbun commented 2 weeks ago

I usually think to GPU utilization as the % of GPU in use. returned by nvidia-smi (or similar), which has no unit of measurement, but could be converted into FLOPS/sec knowing the GPU's peak FLOPS.

On the other hand, what you measure with the profiler gives you a break down of the compute time (communication, ops, I/O), which is useful to study scalability and find bottlenecks (for when we'll want to have a more "active" attitude towards scalability)

Wall clock time gives a nice overview (e.g., avg epoch time).

Regarding the report, I would suggest having a look at the tensorboard integration: https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html Maybe it is not useful for us, but worth trying.

Also, other interesting references: