Use CUDA Events for measuring elapsed time

staghado commented 7 months ago

Use CUDA events to measure elapsed time in the distributed trainer, all pair-to-pair GPUs throughput test and in decode_text.

I would like to add some tests before review but I am not sure what to test yet...

issue : #88

staghado commented 6 months ago

Conducting simple tests by running the tiny llama example(examples/train_tiny_llama.sh), here are the results:

With time.time()

iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.8 | tokens_per_sec: 69.3K 
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 15.1 | tokens_per_sec: 67.7K

With CUDA events with dist.barrier()

iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.3 | tokens_per_sec: 71.5K
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 13.8 | tokens_per_sec: 74.1K

With CUDA events w/o dist.barrier()

iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.3 | tokens_per_sec: 71.6K
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 13.8 | tokens_per_sec: 74.3K

These values fluctuate from a run to another but time.time() seems to overestimate the elapsed times a little bit and dist.barrier() seems to have no effect when using CUDA events.

staghado commented 6 months ago

@NouamaneTazi

huggingface / nanotron

Use CUDA Events for measuring elapsed time #143