Closed mo-soliman closed 1 year ago
Similar problem here: https://discuss.huggingface.co/t/kaggle-tpuvm-doesnt-allow-setting-nprocs-1/35999/2 @muellerzr
Again on TPUs. This can be reproduced really easily in kaggle kernels just with accelerate test --config_file ...
so is not model dependent. When using my own model & script for training I get the same error - so the problem is definitely here with XLA
or accelerate
Any updates? I can't really train my scripts in the meantime :(
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I created a TPU VM v2-8 (From google), when running this example script https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py It fails and gives this error:
It's worth noting: 1) When running the same script but with num_processes=1 in the tpu configuration, it works normally. 2) When running another example from google documentation (training a ResNet, with num_processes=8) it works normally (You can find it here: https://cloud.google.com/tpu/docs/pytorch-xla-ug-tpu-vm#changing_pytorch_version)
git clone --recursive https://github.com/pytorch/xla.git python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
Setting these environment variables didn't help
What is the cause and how to fix this error? Thanks
Expected behavior
Training with no errors