Explain composer logs emitted during training + Replicate Benchmark Results

❓ Question

Hello, I am training an mpt-3B model on AWS SageMaker using an ml.p4d.24xlarge instance and trying to replicate the results displayed in this table: link.

Specifically, I am focusing on replicating the result for the mpt-3b model with the following configuration: max_seq_len: 2048, global_train_batch_size=320, device_train_microbatch_size=5, and 8 a100_40gb GPUs. According to the table, it should be able to process 39 sequences per second. Since I process 320 sequences within one batch, the batch should ideally finish within 8.2 seconds. However, when I run it, it takes around 10 seconds (screenshot attached).

I am also looking for an explanation of the logs emitted by the composer before the start of every batch. I have checked the documentation but couldn't find anything specific. I am particularly interested in understanding the meaning of the following logs:

Train memory/allocated_mem: 6.8051
Train memory/active_mem: 6.8051
Train memory/inactive_mem: 1.9065
Train memory/reserved_mem: 14.6740
Train memory/alloc_retries: 0
Train loss/train/total: 11.6525
Train metrics/train/LanguageCrossEntropy: 11.6525
Train metrics/train/LanguagePerplexity: 114977.1562
Train time/train: 0.0081
Train time/val: 0.0000
Train time/total: 0.0081
Train lr-DecoupledAdamW/group0: 0.0000
Train time/remaining_estimate: 0.0225

Lastly, I would like to know if there is an easy way to calculate TFLOP/s using the above logs.

Here is the bash command that I am running:

composer train/train.py \
  train/yamls/pretrain/mpt-3b.yaml \
  data_local=my-copy-c4 \
  train_loader.dataset.split=train_small \
  eval_loader.dataset.split=val_small \
  max_duration=10ba \
  eval_interval=30ba \
  save_folder=mpt-3b \
  max_seq_len=2048 \
  global_train_batch_size=320 \
  device_train_microbatch_size=5

mosaicml / examples

Explain composer logs emitted during training + Replicate Benchmark Results #414

❓ Question