training loop freezes after first step on TPU

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

7.76k stars 941 forks source link

training loop freezes after first step on TPU #2899

Closed drimeF0 closed 1 month ago

drimeF0 commented 3 months ago

https://colab.research.google.com/drive/1dxNpliW4JZIt6aut340dbUG7I9xLVpWI?usp=sharing

accelerate version: 0.31.0 diffusers version: 0.29.2 torch version: 2.3.0 torch_xla[tpu] version: 2.3.0

At the moment, the code did not continue even after another 30 minutes, then the duration of the session in google colab ended

drimeF0 commented 3 months ago

without accelerate

drimeF0 commented 3 months ago

first step takes 48~ seconds

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.