jarlsondre commented 3 weeks ago

Summary

The scalability report that you get from itwinai scalability_report shows the relative speedup between running a job on a single node vs. on multiple nodes. However, this is not very comprehensive and has some issues with the user experience.

Metrics that should be included

Throughput
- FLOPS, samples/sec, or just total runtime (as this will be inversely proportional)
Communication Overhead
- How much time is spent performing communication vs. actual computation
GPU Utilization
- How much of the GPUs is utilized e.g. on average (as a score from 0 to 100%)

Other improvements

Allowing to pass a folder without needing a RegEx pattern
- Could still allow the use of RegEx, but if you know that a folder contains only the log files then you should be able to just specify the folder without having to think about patterns. If you choose to pass a RegEx then it will be used inside of the given folder e.g.
Perhaps a user interface such as streamlit or tensorboard?
- The PyTorch profiler is supposed to have an integration with tensorboard, so could be worth looking into.

jarlsondre commented 2 weeks ago

Suggested solution

Based on the literature around Horovod, DDP and DeepSpeed (as well as a couple of others, like OneFlow), it seems that most papers focus on throughput, measured as samples/sec or FLOPS, and that a select few (e.g. Horovod) seem to put some emphasis into GPU Utilization and time spent on communication vs computation. Since wall-clock time should be inversely proportional to throughput and is much easier to measure, I suggest we use this.

Note: Any data displayed in the following plots will be completely fabricated by me, so don't read into the numbers.

Throughput

We measure the scalability of throughput in two ways:

Relative to time spent on a single node (or GPU), e.g.
Absolute time, e.g.

Communication vs. Computation

We measure communication vs. computation as a score from 0 to 1, where 0 means all the time was spent on communication and 1 means that all the time was spent on computation. An example can be seen here: Note: The numbers 4, 8 and 16 refer to the number of GPUs in this plot. This is just a draft :)

GPU Utilization

Two key metrics:

GPU Utilization as a percentage of total utilization
- Done to understand how efficient the strategy is
- Will be measured using the same type of plot as with communication vs. computation
Absolute GPU usage in Watts and/or watt-hours
- Done to measure environmental impact
- Will be measured as an absolute number and compared between different configurations to give a more holistic picture on how well strategies scale.

matbun commented 2 weeks ago

I usually think to GPU utilization as the % of GPU in use. returned by nvidia-smi (or similar), which has no unit of measurement, but could be converted into FLOPS/sec knowing the GPU's peak FLOPS.

On the other hand, what you measure with the profiler gives you a break down of the compute time (communication, ops, I/O), which is useful to study scalability and find bottlenecks (for when we'll want to have a more "active" attitude towards scalability)

Wall clock time gives a nice overview (e.g., avg epoch time).

Regarding the report, I would suggest having a look at the tensorboard integration: https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html Maybe it is not useful for us, but worth trying.

Also, other interesting references:

interTwin-eu / itwinai

Updated scalability report (more comprehensive and easier to use) #221