MosaicBERT: pretraining configuration for models > 128 seq. length

mosaicml / examples

Fast and flexible reference benchmarks

Apache License 2.0

424 stars 121 forks source link

MosaicBERT: pretraining configuration for models > 128 seq. length #442

Open stefan-it opened 6 months ago

stefan-it commented 6 months ago

Hi MosaicML team,

many thanks for releasing the code and models for your MosaicBERT! I highly appreciate the effort that you put in modernizing the BERT architecture.

I am interested in pretraining MosaicBERT so I have some questions :)

I am interested in the pretraining configuration for the model with 512 sequence length. Additionally: do you have hardware recommendations and the approx. time to pretrain MosaicBERT with 512 seq. length. Did you use the phase 1 + phase 2 "trick" with pretraining for 128 seq. length and then fewer steps with 512? For that, the MosaicBERT with 128 seq. length could be "recycled".
I'm also interested in what implementation is recommended to use e.g. a tagged/specific commit or the upcoming #440 PR.

Many thanks in advance!

Stefan

Taytay commented 6 months ago

@stefan-it - I tried the commit in main, and ran into a number of errors, and was pointed to #440, so I am planning on basing my work on that unless I hear otherwise.

jacobfulano commented 5 months ago

Hi @stefan-it we did not experiment with training on 128 then switching to 512 (as in the original BERT paper by Devlin et al. 2018). In our experiments, training MosaicBERT-Base on sequence length 512 with batch size 4096 and 70,000 steps took roughly 30 hours on 8 A100 80 GB GPUs (see below).

It might take us a few more days to merge the FA2 PR #440, but do let us know if you run into any issues!

mmarius commented 5 months ago

Hi @jacobfulano, do you also have an estimate for how long it will take to pre-train MosaicBERT-Large on a sequence length of 512 with batch size 4096 for 70,000 steps?

jacobfulano commented 5 months ago

Hi @mmarius, we did not specifically train MosaicBERT-Large with sequence length 512 with batch size 4096 for 70,000 steps. However my estimate would be roughly 4x the time it takes to train MosaicBERT-Large with sequence length 128 with batch size 4096 for 70,000 (~27.2 hours). So roughly 108 hours on 8 A100 80GB GPUs

jacobfulano commented 5 months ago

If you are going any larger than that I would recommend looking at the mosaicml/llm-foundry which should have support for training encoders/embedding models soon.