Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths

mscherrmann commented 1 year ago

I have been exploring the Mosaic-BERT model and I noticed that it is trained on a sequence length of 128. It's my understanding that this length can be easily extrapolated during inference time due to Attention with Linear Biases. However, in one of your blog posts, you compared the Mosaic-BERT model with the Hugging Face BERT base model, and I'm unclear about the sequence length used for training the BERT-Base model.

Specifically, I would like to know if the BERT-Base model, which is used as a benchmark for the mosaic-bert model for example in the appended figure, is trained with a sequence length of 128 or 512? If it is trained with a sequence length of 128, I would like to inquire about the necessary steps to obtain a Mosaic-BERT model that matches the performance of the BERT-Base model with a sequence length of 512.

Thank you for your attention to this matter. I look forward to your response and clarification. BertComparisonMNLI

dakinggg commented 11 months ago

Apologies if I haven't totally understood your question.

From the blogpost: "For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4."

To fully pretrain a model with 512 sequence length, you'll just need to follow our guide, but change the max_seq_len param to 512.

Because of alibi, you can also start with a model trained with sequence length 128, and change max_seq_len to 512 to adapt it.

mscherrmann commented 11 months ago

Thank you!

mscherrmann commented 11 months ago

Hi,

I have one follow-up question:

What do I have to consider regarding "global_train_batch_size" and "device_train_microbatch_size" if I want to train with sequence length of 512 instead of 128 tokens? If I leave everything as in the yamls/main/hf-bert-base-uncased.yaml file I probably get memory problems. Do you have any tips in this regard? Or even better: Do you have a yml for this case? I train on a Nvidia 8x80 GB A100.

Try and Error goes with me unfortunately badly, because I always have to wait quite long until I am on the GPU. Therefore the demand. Thanks a lot!

dakinggg commented 11 months ago

global_train_batch_size is an optimization related setting and you may or may not want to change it. If you increase the sequence length, you see more tokens per batch. device_train_microbatch_size does not affect the math, and is only related to memory. I'm not sure what setting will work on the exact setup you describe, but you can try device_train_microbatch_size=auto, which will determine it for you.

mscherrmann commented 11 months ago

Perfect, thank you for your quick response!

mscherrmann commented 11 months ago

I ran into another issue, sorry...

As mosaic-bert is not finetunable, I use the hf-bert. I follow the approach of the original BERT paper: Train 90% of the steps with a sequence length of 128 and 10% of the steps with a sequence length of 10%.

To accomplish this with your code, i run the "main" scirpt for pretraining twice. The first run completes without any issue. However, in the second run, when I load the previous checkpoint with "load_path" and change sequence length to 512, I get the following error:

ValueError: Reused local directory: ['/mnt/data/train'] vs ['/ mnt/data/train']. Provide a different one.

The data is stored locally. Do you have any idea why this error occurs?

Thank you very much!

karan6181 commented 11 months ago

HI @FinTexIFB , what is your remote and local parameter looks like which you are passing to StreamingDataset ? Since your dataset resides locally, you can actually provide your local directory to local parameter and remote=None. For example, local='/mnt/data/train' and remote=None.

mscherrmann commented 11 months ago

Hi @karan6181,

thank you for your response. Yes, settinglocal='/mnt/data/train' and remote=None is exactly what I've done.

However, I found a workaround by simply creating a new container with the same mosaic docker image and installing all dependencies. Now it works, but only once. When I try to continue pre-training with an existing checkpoint afterwards I'll get the error. Maybe that is a bug

jacobfulano commented 11 months ago

@FinTexIFB, mosaic-bert is finetunable, as can be seen in this yaml. Does this work for your use case?

mosaicml / examples

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407