Closed arkadiusz-czerwinski closed 1 year ago
Thanks for the report! Would you like to open a PR with your suggested change?
This actually should probably be changed to utilize install_xla
instead. (seen here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/torch_xla.py)
@arkadiusz-czerwinski any chance you could verify that works as a fix, and as Sylvain suggested perhaps put a PR in? :)
To use:
from accelerate.utils import install_xla
install_xla()
Thank you for you suggestion. Unfortunately neither using install_xla()
nor updating a wheel inside a script do help. I will do my best to look into this issue, hovewer I am not sure how my time avaliability will look like.
The current bug look like it tries to initialize TPU once again on each core, since the message of error above is repeated 8 times.
So XLA is a whole on Colab is broken essentially, noted. I'll look into this asap.
Wierdly enough, this copy of pytorch/xla example seems to be working after minor tweak of installing and uninstalling torchvision, so maybe some changes in implementation of common functions (pytorch/xla) was the reason. Unfortunately it seems like your previous message has no steps to reproduce, I assume some kind of common pasting error.
Yeah apologies that wasn't a message that was supposed to be sent :)
Thanks @arkadiusz-czerwinski, this is indeed a critical issue that seems to stem on a sneaky change on XLA's side and I've reported it to their team. Hopefully we can have a swift resolution with their help. For now you can still run your code through the CLI in Colab, I've confirmed that this still works.
Update: it's an issue on our side somewhere between the latest release and now. In the interim @arkadiusz-czerwinski you can install accelerate with just pip install accelerate
, there's nothing TPU specific on main at this time so it's okay to do this!
Great thanks for the fix. Do you have a working example after a fix in a notebook? Simple NLP Example at this minute using "cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl" pip install command seems to be generating same mistake, as seen in this colab. It is very likely I am doing something wrong, or that package wasnt updated yet, so I am deeply sorry if it is a case.
You need to install accelerate through either pip install accelerate
or pip install git+https://github.com/huggingface/accelerate
. The other ones you mention there don't matter
(Before it installed from git)
Af of now, on python 3.8, which seems to be the only python installed on colab it fails. Should I revert to python 3.7? Once again I am sorry to ask that question, but for whatever reason notebook_launcher still fails for me, and it is most likely due to some kind of setup mishap.
Can you share the whole setup/notebook you are running with me?
And can you try installing with:
! pip install datasets transformers
! pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
! pip install git+https://github.com/huggingface/accelerate
As this worked for me, I will try with torch 1.13.0 here shortly
I see that it pulled the right commit, let me try with torch 1.13
Sure, link is here: https://colab.research.google.com/drive/1UGera0AH0hkQwCuTvopN5Phmb0grznhS?usp=sharing As for code above, colab is running, at least in my case runs with python 3.8, so wheel for 3.7 wouldnt work. As for the setup above, it returns: ERROR: torch_xla-1.9-cp37-cp37m-linux_x86_64.whl is not a supported wheel on this platform.
@arkadiusz-czerwinski I was able to get it working on the following install setup, and I will put a PR in to change the notebooks to show this as well, unless you would like to :) (They're over on huggingface/notebooks/examples/accelerate):
! pip install datasets transformers evaluate
! pip install cloud-tpu-client==0.10 torch==1.13.0
! pip install https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
! pip install git+https://github.com/huggingface/accelerate
(The same install you had at the start of this issue)
Hi, with this enviroment settings: And this notebook: https://colab.research.google.com/drive/1RooP8tXNTO-07S9BIyAAJ398lBT2ZgJ9?usp=sharing It still doesnt appear to work. Would you mind running said notebook and see if the output changes? You can edit it as you like. After it is resolved, i certainly wouldnt mind updating and testing remaining examples and install_xla
Interesting, it works not on high-ram (what I was testing on). Let me look at high-ram version now
@arkadiusz-czerwinski it ran just fine for me on a high-ram instance. See this notebook: https://gist.github.com/muellerzr/55c78e876efa9df9c10c704607e42b0d
(Note I cut it off simply to show it's running without error, it's currently training at < 5 min total rn :) )
So apparently my runtime environment wasn't properly deleted, but now it is working. Sorry for causing trouble. I am not sure if you don't me to do so, but I certainly can test changes across different notebooks and then create PR changing it.
Working in colab can be frustrating with that, believe me :) Definitely please feel free to debug and open a PR! :) Happy to help if you still face issues there too
I also found the error that was causing a little bit of the errors. when executing debug line:
for batch in train_dataloader:
print({k: v.shape for k, v in batch.items()})
outputs = model(**batch)
break
Despite not creating an accelerator instance before, pytorch/xla seems to be doing some hidden initialization, so the moment model is called before training_function it is assigned a device, and during reassigning in the loop, it creates an error in the config. I will also add it as a comment to a notebook if no one will mind. Maybe it is obvious, but I think an additional remark may help some people.
Yes, a note on that would be great!
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Go to https://github.com/huggingface/accelerate#launching-your-training-from-a-notebook Run a notebook Xla version wont match. Upon updating first cell to:
Following error is generated:
It seems like an error regarding distirbution of data and models across different tpu cores, running the examples from terminal doesnt solve the issue, at least for me for current colab version, although this statement should be verified, as I might have done something wrong.
Unfortunatelly, for python 3.8, which is default python for colab only this wheel is avaliable for pytorch_xla as per link.
Expected behavior
Pytorch xla version should be updated, i.e using following command, although unlike command below which is only exemplary, it shouldn't throw an error above.