mgrankin / ru_transformers

Apache License 2.0

776 stars 108 forks source link

Saving the model freezes the TPU when using the latest torch_xla package and provided tpu_lm_finetuning script #28

Closed vbogach closed 4 years ago

vbogach commented 4 years ago

https://github.com/mgrankin/ru_transformers/blob/766343f3ed121ca7c4583c1d0cfaa4a390e88db8/tpu_lm_finetuning.py#L307 call freezes the TPU due to https://github.com/pytorch/xla/blob/46bff8a6e12035f1857c52e74e263c7077cd3ed2/torch_xla/core/xla_model.py#L635 rendezvous in xm.save(...) function. In tpu_lm_finetuning.py, however, the function is called only when xm.is_master_ordinal() so the rendezvous point is never reached.

vbogach commented 4 years ago

21 might actually be the same issue

mgrankin commented 4 years ago

Thank you for reporting the issue. I believe the error is related to the age of the code. Pytorch/XLA changes quite quickly, and my code has not been updated for some time. rendevous is a relatively new function. I have a plan to update the project later this year. Meanwhile, a PR is welcome!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.