huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

Use CUDA Events for measuring elapsed time #143

Open staghado opened 7 months ago

staghado commented 7 months ago

Use CUDA events to measure elapsed time in the distributed trainer, all pair-to-pair GPUs throughput test and in decode_text.

I would like to add some tests before review but I am not sure what to test yet...

issue : #88

staghado commented 6 months ago

Conducting simple tests by running the tiny llama example(examples/train_tiny_llama.sh), here are the results:

iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.8 | tokens_per_sec: 69.3K 
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 15.1 | tokens_per_sec: 67.7K
iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.3 | tokens_per_sec: 71.5K
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 13.8 | tokens_per_sec: 74.1K
iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.3 | tokens_per_sec: 71.6K
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 13.8 | tokens_per_sec: 74.3K

These values fluctuate from a run to another but time.time() seems to overestimate the elapsed times a little bit and dist.barrier() seems to have no effect when using CUDA events.

staghado commented 6 months ago

@NouamaneTazi