huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.97k stars 970 forks source link

save_state() and load_state() do not work correctly with multi-gpu with shuffle=True in dataloader #3158

Closed isayoften closed 3 days ago

isayoften commented 1 month ago

System Info

- `Accelerate` version: 1.0.0
- Platform: Linux-5.15.154+-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 31.36 GB
- GPU type: Tesla T4
- `Accelerate` default config:
    Not found

Information

Tasks

Reproduction

https://www.kaggle.com/code/amanattheedge/demonstration

Expected behavior

Maybe I'm doing something wrong, but save_state() and load_state() should memorize the RNG states so that the previous shuffling of data within the new epoch can be restored.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.