Open amishparekh opened 5 days ago
Hi @amishparekh! Do I understand correctly that you run into this issue with both composer = 0.23.3 and composer = 0.17.2? What version of streaming are you using?
Hi @jacobfulano! Yes I faced first with 0.17.2 so upgraded to 0.23.3. Here is my requirements file: einops==0.5.0 torch==2.2.1 composer[nlp,wandb]==0.23.3 mosaicml-streaming==0.7 mosaicml==0.23.3 omegaconf==2.3.0 transformers==4.35.2 flash_attn==2.5.8
I used the requirements file mentioned in the FA2 to bert github repo as well. Since, it wasn't working changed and upgraded to latest versions. Thank you!
Wandb logging screenshot for reference.
Tried train_microbatch_size: auto, 128. Eval_batch_size: 128,64, 1.
Hi @jacobfulano! Did you get a chance to look into this issue? Any help will be massively appreciated. I see some issues for previous pytorch versions where num_workers>0 had some errors but I don't see how it will affect now. Thanks
Environment composer = 0.23.3, composer = 0.17.2 GPU Stack: 8 x A100 80GB CUDA: 12.1
** To reproduce
Steps to reproduce the behavior:
Expected behavior
Ideally, when running the script, the training starts and after every 2000 steps evaluation is performed post which training should resume
Additional context
I ran the benchmark mosaicbert pretraining script and since we are having the data at a remote S3 location. I transferred the converted streaming dataset c4 to S3 bucket to test before running it on our data. The training started as expected and at 2000 steps the evaluation started, at the end of evaluation the GPU utilisation goes to zero and the training doesn't resume further. I don't even see any error but my GPU memory is still occupied. Once I manually quit, GPU memory goes down and everything is back to normal. I am not able to figure out why the training doesn't resume or no error present. Since, the training started and evaluation reached till end as well, my guess is script may not be the issue and hence created a bug in composer if there is something I am missing. Any help would be appreciated. Thanks.