Open staghado opened 7 months ago
Conducting simple tests by running the tiny llama example(examples/train_tiny_llama.sh
), here are the results:
time.time()
iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.8 | tokens_per_sec: 69.3K
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 15.1 | tokens_per_sec: 67.7K
dist.barrier()
iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.3 | tokens_per_sec: 71.5K
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 13.8 | tokens_per_sec: 74.1K
dist.barrier()
iteration: 14 / 15 | elapsed_time_per_iteration_ms: 14.3 | tokens_per_sec: 71.6K
iteration: 15 / 15 | elapsed_time_per_iteration_ms: 13.8 | tokens_per_sec: 74.3K
These values fluctuate from a run to another but time.time()
seems to overestimate the elapsed times a little bit and dist.barrier()
seems to have no effect when using CUDA events.
@NouamaneTazi
Use CUDA events to measure elapsed time in the distributed trainer, all pair-to-pair GPUs throughput test and in
decode_text
.I would like to add some tests before review but I am not sure what to test yet...
issue : #88