Hello, I am training an mpt-3B model on AWS SageMaker using an ml.p4d.24xlarge instance and trying to replicate the results displayed in this table: link.
Specifically, I am focusing on replicating the result for the mpt-3b model with the following configuration: max_seq_len: 2048, global_train_batch_size=320, device_train_microbatch_size=5, and 8 a100_40gb GPUs. According to the table, it should be able to process 39 sequences per second. Since I process 320 sequences within one batch, the batch should ideally finish within 8.2 seconds. However, when I run it, it takes around 10 seconds (screenshot attached).
I am also looking for an explanation of the logs emitted by the composer before the start of every batch. I have checked the documentation but couldn't find anything specific. I am particularly interested in understanding the meaning of the following logs:
❓ Question
Hello, I am training an mpt-3B model on AWS SageMaker using an ml.p4d.24xlarge instance and trying to replicate the results displayed in this table: link.
Specifically, I am focusing on replicating the result for the mpt-3b model with the following configuration: max_seq_len: 2048, global_train_batch_size=320, device_train_microbatch_size=5, and 8 a100_40gb GPUs. According to the table, it should be able to process 39 sequences per second. Since I process 320 sequences within one batch, the batch should ideally finish within 8.2 seconds. However, when I run it, it takes around 10 seconds (screenshot attached).
I am also looking for an explanation of the logs emitted by the composer before the start of every batch. I have checked the documentation but couldn't find anything specific. I am particularly interested in understanding the meaning of the following logs:
Lastly, I would like to know if there is an easy way to calculate TFLOP/s using the above logs.
Here is the bash command that I am running: