Closed Neo9061 closed 1 month ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I just had the same error, it happened when saving weights at end of epoch. Any suggestion?
Hey! Could you give a bit more details about the transformers version you are using / accelerate version ets:
transformers-cli env
should output that
@ArthurZucker thanks for taking a look! I was trying to fine-tune the model of idefic2, I also tried all transformer versions that enables idefics and none of them worked. The training loss looks all good. I found discussions here but (1) try more new trainings (2)set the timeout value to a much bigger number both failed(by failure in 2 I let it run for ~12 hours and still does not go to epoch 2).
- `transformers` version: 4.44.0.dev0
- Platform: Linux-5.10.0-31-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.24.2
- Safetensors version: 0.4.3
- Accelerate version: 0.33.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-SXM4-80GB
I think this was fixed in the 4.44.1
patch!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Following on Philip's blogpost to conduct FSDP + QLoRA in SageMaker
Training script is the default one.
The model I used is mistral-community/Mixtral-8x22B-v0.1.
Training instances are 2 instances of P4DE.24XLARGE (each instance has 640GB GPU and 1024 GB CPU).
Training is completed but hit failure at the very last step model saving (this line), where loading base model and merging base model and adaptor have been completed.
Error is following.
I checked the memory usage and it seems not because of OOM. I wonder if that is because the model merging and saving step took a lot of time and make the process time out?
Who can help?
@ArthurZucker @philschmid @muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
See descriptions above
Expected behavior
Error free