Open jkl375 opened 7 months ago
Emm interesting. Honestly, I did not run 700K for this long. I've only run the first 5 steps and call it a day due to my limited compute resources. Yeah I think setting the accumulation to 1 would help.
After setting accumulation to 1, there is still appearing oom at step 7.
I wonder if it's because the author of ring flash attn zhuzilin mentioned limits https://github.com/zhuzilin/ring-flash-attention?tab=readme-ov-file#limits
Just add PYTORCH_CUDA_ALLOC_CONF='max_split_size_mb:1024' before accelerate, so there's no problem training to the 37th step for now.
Hi, author. When I set seq-length=700k, OOM occured. My torch version is 2.4.0.dev20240324. Do I need to set gradient-accumulate-every to 1?