mosaicml / examples

Fast and flexible reference benchmarks
Apache License 2.0
435 stars 124 forks source link

Explain composer logs emitted during training + Replicate Benchmark Results #414

Closed geodra closed 1 year ago

geodra commented 1 year ago

❓ Question

Hello, I am training an mpt-3B model on AWS SageMaker using an ml.p4d.24xlarge instance and trying to replicate the results displayed in this table: link.

Specifically, I am focusing on replicating the result for the mpt-3b model with the following configuration: max_seq_len: 2048, global_train_batch_size=320, device_train_microbatch_size=5, and 8 a100_40gb GPUs. According to the table, it should be able to process 39 sequences per second. Since I process 320 sequences within one batch, the batch should ideally finish within 8.2 seconds. However, when I run it, it takes around 10 seconds (screenshot attached).

I am also looking for an explanation of the logs emitted by the composer before the start of every batch. I have checked the documentation but couldn't find anything specific. I am particularly interested in understanding the meaning of the following logs:

Lastly, I would like to know if there is an easy way to calculate TFLOP/s using the above logs.

Here is the bash command that I am running:

composer train/train.py \
  train/yamls/pretrain/mpt-3b.yaml \
  data_local=my-copy-c4 \
  train_loader.dataset.split=train_small \
  eval_loader.dataset.split=val_small \
  max_duration=10ba \
  eval_interval=30ba \
  save_folder=mpt-3b \
  max_seq_len=2048 \
  global_train_batch_size=320 \
  device_train_microbatch_size=5

image

dakinggg commented 1 year ago

Closing as a duplicate of https://github.com/mosaicml/llm-foundry/issues/444