Training stops after first pass of Evaluation when pretraining MosaicBert

amishparekh commented 5 days ago

Environment composer = 0.23.3, composer = 0.17.2 GPU Stack: 8 x A100 80GB CUDA: 12.1

** To reproduce

Steps to reproduce the behavior:

Run MosaicBERT using https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert which has Flash Attention 2 implementation.
Change the remote data location to aws S3 path
Wait till the end of evaluation and observe GPU utilisation

Expected behavior

Ideally, when running the script, the training starts and after every 2000 steps evaluation is performed post which training should resume

Additional context

I ran the benchmark mosaicbert pretraining script and since we are having the data at a remote S3 location. I transferred the converted streaming dataset c4 to S3 bucket to test before running it on our data. The training started as expected and at 2000 steps the evaluation started, at the end of evaluation the GPU utilisation goes to zero and the training doesn't resume further. I don't even see any error but my GPU memory is still occupied. Once I manually quit, GPU memory goes down and everything is back to normal. I am not able to figure out why the training doesn't resume or no error present. Since, the training started and evaluation reached till end as well, my guess is script may not be the issue and hence created a bug in composer if there is something I am missing. Any help would be appreciated. Thanks.

jacobfulano commented 5 days ago

Hi @amishparekh! Do I understand correctly that you run into this issue with both composer = 0.23.3 and composer = 0.17.2? What version of streaming are you using?

amishparekh commented 5 days ago

Hi @jacobfulano! Yes I faced first with 0.17.2 so upgraded to 0.23.3. Here is my requirements file: einops==0.5.0 torch==2.2.1 composer[nlp,wandb]==0.23.3 mosaicml-streaming==0.7 mosaicml==0.23.3 omegaconf==2.3.0 transformers==4.35.2 flash_attn==2.5.8

I used the requirements file mentioned in the FA2 to bert github repo as well. Since, it wasn't working changed and upgraded to latest versions. Thank you!

amishparekh commented 5 days ago

Wandb logging screenshot for reference.

Tried train_microbatch_size: auto, 128. Eval_batch_size: 128,64, 1.

amishparekh commented 4 days ago

Hi @jacobfulano! Did you get a chance to look into this issue? Any help will be massively appreciated. I see some issues for previous pytorch versions where num_workers>0 had some errors but I don't see how it will affect now. Thanks

mosaicml / composer

Training stops after first pass of Evaluation when pretraining MosaicBert #3421

Expected behavior

Additional context