pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.47k stars 471 forks source link

Cloud installation error of .whl file #4663

Open mfatih7 opened 1 year ago

mfatih7 commented 1 year ago

Hello

I am trying to follow the instructions here.

cd /usr/share/
sudo git clone -b release/1.13 --recursive https://github.com/pytorch/pytorch 
cd pytorch/
sudo git clone -b r1.13 --recursive https://github.com/pytorch/xla.git
cd xla/
yes | sudo pip3 uninstall torch_xla
yes | sudo pip3 uninstall torch
yes | sudo pip3 uninstall torch_vision
sudo pip3 install torch==1.13.0
sudo pip3 install torchvision==0.14.0
sudo pip3 install https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
sudo rm -rf /usr/local/lib/python3.8/dist-packages/libtpu*
sudo pip3 install torch_xla[tpuvm]

In line

sudo pip3 install https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.13-cp38-cp38-linux_x86_64.whl I get the error

ERROR: torch_xla-nightly-cp38-cp38-linux_x86_64.whl is not a supported wheel on this platform.

The current Python version of Google Cloud observed from Shell is

Python 3.9.2

What can I do?

JackCaoG commented 1 year ago

We currently does not have 3.9 wheel on nightly, nor for releases. However, you are trying to install a TPUVM wheel, which suggest you want to run this on a TPUVM. All TPUVM should have python 3.8 for now, which image are you using to start your TPUVM?

mfatih7 commented 1 year ago

I am accessing Google Cloud TPUs under TPU Research Cloud Program.

I generated TPU-VMs for PyTorch with

gcloud compute tpus tpu-vm create tpu-v2-8-000 --zone=us-central1-f --accelerator-type=v2-8 --version=tpu-vm-pt-1.13

JackCaoG commented 1 year ago

hmm weird, I was not expecting that. can you run

gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
 --zone ${ZONE} \
 --project ${PROJECT_ID}

to ssh to vm and do

python3

then check which python version it is. If it is python 3.9 then something is going wrong..

mfatih7 commented 1 year ago

Hello

I realized that I was trying to make operations on ide.cloud.google.com shell. Its shell has Python3 3.9.2.

After I connect to VM via SSH from ide.cloud.google.com shell I get a console with Python3 3.8.10. I also connected to VM via SSH from Google Cloud SDK Shell and got a console with Python3 3.8.10.

However, I am still getting errors when I try to implement the procedure on the link.

import torch_xla.core.xla_model as xm

line gives the error below

>>> import torch_xla.core.xla_model as xm
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/share/pytorch/xla/torch_xla/__init__.py", line 9, in <module>
    from .version import __version__
ModuleNotFoundError: No module named 'torch_xla.version'
mfatih7 commented 1 year ago

OK

The problem is after running the installation lines

cd /usr/share/
sudo git clone -b release/1.13 --recursive https://github.com/pytorch/pytorch 
cd pytorch/
sudo git clone -b r1.13 --recursive https://github.com/pytorch/xla.git
cd xla/
yes | sudo pip3 uninstall torch_xla
yes | sudo pip3 uninstall torch
yes | sudo pip3 uninstall torch_vision
sudo pip3 install torch==1.13.0
sudo pip3 install torchvision==0.14.0
sudo pip3 install https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.13-cp38-cp38-linux_x86_64.whl
sudo rm -rf /usr/local/lib/python3.8/dist-packages/libtpu*
sudo pip3 install torch_xla[tpuvm]

we must change the working directory back to /home/user_name Because I think xla is not accesssible when working directory is /usr/share/pytorch/xla

Adding a change directory command before the imports might be helpful.

cd /home/user_name

When I use ide.cloud.google.com I can observe the file system of the machine on which ide shell is working. Can I observe the file system of VMs on ide.cloud.google.com after I connect to VMs using ide.cloud.google.com shell? Or do I need to add cloud code to my Visual Studio Code on my local machine to observe the file system of VMs?

JackCaoG commented 1 year ago

ModuleNotFoundError: No module named 'torch_xla.version' is because you run the python command inside the xla dir. When you import torch_xla it import the local dir instead the system one which you installed. If you move the home directory that problem should be gone.

As of the VsCode, I believe vscode can handle remote ssh. @alanwaketan should have some more info regarding that.

mileseverett commented 1 year ago

Google colab is now on Python 3.9 btw

JackCaoG commented 1 year ago

@mileseverett Thanks! We are working on the python 3.9 wheel build for colab. @yeounoh FYI

debloper commented 1 year ago

In Google Colab, it creates a weird situation where python doesn't allow installing from that wheel as it's for version 3.8; and a the build for 3.9 is not yet available (ctrl+f in https://storage.googleapis.com/tpu-pytorch/).

I had to use !update-alternatives --set python3 /usr/bin/python3.8 (thankfully, 3.8 is already installed) to make the module installations to go through. Putting it here in case someone finds it useful (until the 3.9 package is released, that is).

However, the suggestion to change directory didn't work in colab environment:

image
alanwaketan commented 1 year ago

Thanks, @debloper. BTW, Jack is OOO until April 17th.

mfatih7 commented 1 year ago

@debloper

Can you check here?