TPU cannot be ran in Colab anymore

arkadiusz-czerwinski commented 1 year ago

System Info

Current accelerate pytorch example in readme doesnt work.

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

Go to https://github.com/huggingface/accelerate#launching-your-training-from-a-notebook Run a notebook Xla version wont match. Upon updating first cell to:

  !pip install datasets transformers
  !pip install cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
  !pip install git+https://github.com/huggingface/accelerate

Following error is generated:

Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn

It seems like an error regarding distirbution of data and models across different tpu cores, running the examples from terminal doesnt solve the issue, at least for me for current colab version, although this statement should be verified, as I might have done something wrong.

Unfortunatelly, for python 3.8, which is default python for colab only this wheel is avaliable for pytorch_xla as per link.

Expected behavior

Pytorch xla version should be updated, i.e using following command, although unlike command below which is only exemplary, it shouldn't throw an error above.

!pip install cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl

sgugger commented 1 year ago

Thanks for the report! Would you like to open a PR with your suggested change?

muellerzr commented 1 year ago

This actually should probably be changed to utilize install_xla instead. (seen here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/torch_xla.py)

@arkadiusz-czerwinski any chance you could verify that works as a fix, and as Sylvain suggested perhaps put a PR in? :)

To use:

from accelerate.utils import install_xla
install_xla()

arkadiusz-czerwinski commented 1 year ago

Thank you for you suggestion. Unfortunately neither using install_xla() nor updating a wheel inside a script do help. I will do my best to look into this issue, hovewer I am not sure how my time avaliability will look like.

The current bug look like it tries to initialize TPU once again on each core, since the message of error above is repeated 8 times.

muellerzr commented 1 year ago

So XLA is a whole on Colab is broken essentially, noted. I'll look into this asap.

arkadiusz-czerwinski commented 1 year ago

Wierdly enough, this copy of pytorch/xla example seems to be working after minor tweak of installing and uninstalling torchvision, so maybe some changes in implementation of common functions (pytorch/xla) was the reason. Unfortunately it seems like your previous message has no steps to reproduce, I assume some kind of common pasting error.

muellerzr commented 1 year ago

Yeah apologies that wasn't a message that was supposed to be sent :)

muellerzr commented 1 year ago

Thanks @arkadiusz-czerwinski, this is indeed a critical issue that seems to stem on a sneaky change on XLA's side and I've reported it to their team. Hopefully we can have a swift resolution with their help. For now you can still run your code through the CLI in Colab, I've confirmed that this still works.

muellerzr commented 1 year ago

Update: it's an issue on our side somewhere between the latest release and now. In the interim @arkadiusz-czerwinski you can install accelerate with just pip install accelerate, there's nothing TPU specific on main at this time so it's okay to do this!

arkadiusz-czerwinski commented 1 year ago

Great thanks for the fix. Do you have a working example after a fix in a notebook? Simple NLP Example at this minute using "cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl" pip install command seems to be generating same mistake, as seen in this colab. It is very likely I am doing something wrong, or that package wasnt updated yet, so I am deeply sorry if it is a case.

muellerzr commented 1 year ago

You need to install accelerate through either pip install accelerate or pip install git+https://github.com/huggingface/accelerate. The other ones you mention there don't matter

(Before it installed from git)

arkadiusz-czerwinski commented 1 year ago

Af of now, on python 3.8, which seems to be the only python installed on colab it fails. Should I revert to python 3.7? Once again I am sorry to ask that question, but for whatever reason notebook_launcher still fails for me, and it is most likely due to some kind of setup mishap.

muellerzr commented 1 year ago

Can you share the whole setup/notebook you are running with me?

And can you try installing with:

! pip install datasets transformers
! pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
! pip install git+https://github.com/huggingface/accelerate

As this worked for me, I will try with torch 1.13.0 here shortly

muellerzr commented 1 year ago

I see that it pulled the right commit, let me try with torch 1.13

arkadiusz-czerwinski commented 1 year ago

Sure, link is here: https://colab.research.google.com/drive/1UGera0AH0hkQwCuTvopN5Phmb0grznhS?usp=sharing As for code above, colab is running, at least in my case runs with python 3.8, so wheel for 3.7 wouldnt work. As for the setup above, it returns: ERROR: torch_xla-1.9-cp37-cp37m-linux_x86_64.whl is not a supported wheel on this platform.

muellerzr commented 1 year ago

@arkadiusz-czerwinski I was able to get it working on the following install setup, and I will put a PR in to change the notebooks to show this as well, unless you would like to :) (They're over on huggingface/notebooks/examples/accelerate):

! pip install datasets transformers evaluate
! pip install cloud-tpu-client==0.10 torch==1.13.0
! pip install https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
! pip install git+https://github.com/huggingface/accelerate

(The same install you had at the start of this issue)

arkadiusz-czerwinski commented 1 year ago

Hi, with this enviroment settings: And this notebook: https://colab.research.google.com/drive/1RooP8tXNTO-07S9BIyAAJ398lBT2ZgJ9?usp=sharing It still doesnt appear to work. Would you mind running said notebook and see if the output changes? You can edit it as you like. After it is resolved, i certainly wouldnt mind updating and testing remaining examples and install_xla

muellerzr commented 1 year ago

Interesting, it works not on high-ram (what I was testing on). Let me look at high-ram version now

muellerzr commented 1 year ago

@arkadiusz-czerwinski it ran just fine for me on a high-ram instance. See this notebook: https://gist.github.com/muellerzr/55c78e876efa9df9c10c704607e42b0d

(Note I cut it off simply to show it's running without error, it's currently training at < 5 min total rn :) )

arkadiusz-czerwinski commented 1 year ago

So apparently my runtime environment wasn't properly deleted, but now it is working. Sorry for causing trouble. I am not sure if you don't me to do so, but I certainly can test changes across different notebooks and then create PR changing it.

muellerzr commented 1 year ago

Working in colab can be frustrating with that, believe me :) Definitely please feel free to debug and open a PR! :) Happy to help if you still face issues there too

arkadiusz-czerwinski commented 1 year ago

I also found the error that was causing a little bit of the errors. when executing debug line:

for batch in train_dataloader:
    print({k: v.shape for k, v in batch.items()})
    outputs = model(**batch)
    break

Despite not creating an accelerator instance before, pytorch/xla seems to be doing some hidden initialization, so the moment model is called before training_function it is assigned a device, and during reassigning in the loop, it creates an error in the config. I will also add it as a comment to a notebook if no one will mind. Maybe it is obvious, but I think an additional remark may help some people.

muellerzr commented 1 year ago

Yes, a note on that would be great!

huggingface / accelerate