googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.17k stars 706 forks source link

Upgrading PyTorch to v2.2.0 and torchvision to v0.17.0 and torchaudio to v2.2.0 #4344

Open atalman opened 7 months ago

atalman commented 7 months ago

Hello,

We released

pytorch v2.2.0 torchvision v0.17.0 torchaudio v2.2.0

The wheel installation instructions are.

pytorch

Install command for CUDA 12.1 environment:

pip install torch==2.2.0

Project link: https://pypi.org/project/torch/2.2.0/#files

torchvision

Install command for CUDA 12.1 environment:

pip install torchvision==0.17.0

https://pypi.org/project/torchvision/0.17.0/#files

torchaudio

Install command for CUDA 12.1 environment:

pip install torchaudio==2.2.0

https://pypi.org/project/torchaudio/2.2.0/#files

Other notes If you require wheels for Python 3.8, 3.9, 3.10, 3.11 or 3.12. We support CPU, CUDA 11.8 and CUDA 12.1 Compute Plaforms. You can find the links here: download.pytorch.org/whl/torch_stable.html

We're looking to having it updated in Colab.

Thanks very much.

cc'ing @colaboratory-team @mayankmalik-colab @malfet @seemethere

Similar to https://github.com/googlecolab/colabtools/issues/4039

mayankmalik-colab commented 7 months ago

We will do that soon. Tracked internally at b/323302699 .

atalman commented 7 months ago

Hello @mayankmalik-colab we just released version 2.2.1: pypi.org/project/torch/2.2.1/#files Please use this instead of 2.2.0 version

mayankmalik-colab commented 7 months ago

Hello @atalman , I was trying to use torch-2.2.1 wheel but it installs Cuda dependencies as well, which is not the case with the current torch-2.1.0 wheel we use. It is important that we don't install Cuda dependecies as those mess up with other frameworks like Jax. I want to know if you can point me to torch-2.2.1 wheels that don't install cuda dependencies?

Example:

huydhn commented 7 months ago

From what I see, we switch from the big wheel model in 2.1.0 where all the CUDA dependencies are bundled inside PyTorch to the small wheel model in 2.2.x where CUDA dependencies come from PyPI (if you are using PIP). You can see the size of 2.1.0 cu121 wheel is 2GB+ while 2.2.1 cu121 is around 700MB.

I have two thoughts:

atalman commented 6 months ago

hi @mayankmalik-colab Yes torch 2.1.0 we use to package the CUDA dependencies in the torch lib folder, and we provided version version that installed torch + cuda dependencies via pip as well. Since 2.2.0 we switched to small wheel installing CUDA dependencies via pip.

if you preinstall all CUDA dependencies then doing pip install should install only torch and other missing dependencies like this:

pip3 install --pre torch  --index-url https://download.pytorch.org/whl/cu121 --upgrade
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://download.pytorch.org/whl/cu121
Requirement already satisfied: torch in /home/atalman/.local/lib/python3.8/site-packages (2.3.0.dev20240227+cu121)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (12.1.105)
Requirement already satisfied: typing-extensions>=4.8.0 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (4.9.0)
Requirement already satisfied: fsspec in /usr/local/lib/python3.8/dist-packages (from torch) (2023.1.0)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (10.3.2.106)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (8.9.2.26)
Requirement already satisfied: pytorch-triton==3.0.0+901819d2b6 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (3.0.0+901819d2b6)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from torch) (3.1.2)
Requirement already satisfied: nvidia-nccl-cu12==2.19.3 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (2.19.3)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (11.0.2.54)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (12.1.105)
Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from torch) (3.9.0)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (11.4.5.107)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (12.1.105)
Requirement already satisfied: networkx in /home/atalman/.local/lib/python3.8/site-packages (from torch) (3.1)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /home/atalman/.local/lib/python3.8/site-packages (from torch) (12.1.0.106)
Requirement already satisfied: sympy in /home/atalman/.local/lib/python3.8/site-packages (from torch) (1.12)
Requirement already satisfied: nvidia-nvjitlink-cu12 in /home/atalman/.local/lib/python3.8/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch) (12.3.101)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->torch) (2.1.2)
Requirement already satisfied: mpmath>=0.19 in /home/atalman/.local/lib/python3.8/site-packages (from sympy->torch) (1.3.0)
malfet commented 6 months ago

I was trying to use torch-2.2.1 wheel but it installs Cuda dependencies as well, which is not the case with the current torch-2.1.0 wheel we use. It is important that we don't install Cuda dependecies as those mess up with other frameworks like Jax.

@mayankmalik-colab just want to clarify, that those cuda dependencies are not system wide, i.e. they are installed in python site-packages folder and unless one sets LD_PRELOAD in a very specific way should not be visible from other packages. In 2.1.0 wheel it was quite similar story, i.e. those dependencies were a dynamic libraries bundled with the wheel and only when TORCH is imported respective dependencies were loaded in the processes address space

I wonder if you have a test that you run that used to work with Torch-2.1.0 and JAX, but fails with Torch-2.2.1

atalman commented 6 months ago

For reference, this is issue we used deprecating large wheels: https://github.com/pytorch/pytorch/issues/113972

Is this issue related to https://github.com/googlecolab/colabtools/issues/4345 ?

@mayankmalik-colab Could you please post more information about conflict you are seeing. We are interested in following:

So that we can try to reproduce this conflict in our environment. Are installation scripts in OSS ? Could you post a link or share some of the script with us ?

kiukchung commented 6 months ago

Would torch-2.2.1 work with cuda-12.4 without building from source or are there specific kernels that wouldn't work?

Tried to install the latest JAX and torch on a python 3.10 virtual-env. Installed jax first then pytorch to force pip to pull CUDA-12.4 (what jax depends on) and got this error:

$ pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
$  pip install torch==2.2.1

$ python -c "import torch"                                                                                                                                                                              ~/workspace
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/google/home/kiuk/.pyenv/versions/venv310/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: /usr/local/google/home/kiuk/.pyenv/versions/venv310/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

Looking at the error more closely, seemed like it was due to the versions between cusparse and nvjitlink not matching up. Looks like JAX doesn't depend on cusparse so cusparse-12.1.0.106 was installed with torch but nvjitlink was left as 12.4.99 (from the jax install).

So I figured I'd just ugrade to cusparse-12.3.0.142 and see what happens. And was able to get past the error above. That said, I realize I'm not actually hitting any sparse kernels so its hard to say whether this actually works, and in any case it'll be better to build pytorch from source.

Here's the list of nvidia libs and versions that I had in my virtual environment to get past initial errors from both jax and torch and successfully create CUDA tensors on each lib (again, I haven't actually run any ops so its hard to conclude that this works):

nvidia-cublas-cu12       12.4.2.65
nvidia-cuda-cupti-cu12   12.4.99
nvidia-cuda-nvcc-cu12    12.4.99
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.4.99
nvidia-cudnn-cu12        8.9.7.29
nvidia-cufft-cu12        11.2.0.44
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.6.0.99
nvidia-cusparse-cu12     12.3.0.142
nvidia-nccl-cu12         2.19.3
nvidia-nvjitlink-cu12    12.4.99
nvidia-nvtx-cu12         12.1.105

triton                   2.2.0

jax                      0.4.25
jaxlib                   0.4.25+cuda12.cudnn89

torch                    2.2.1

FWIW Internally in google3 we are building torch-2.2.1 with CUDA-12.3 (but we use clang not nvcc to compile cuda)

cc @malfet , @atalman

mayankmalik-colab commented 6 months ago

For reference, this is issue we used deprecating large wheels: pytorch/pytorch#113972

Is this issue related to #4345 ?

@mayankmalik-colab Could you please post more information about conflict you are seeing. We are interested in following:

  • What version of JAX and how its installed
  • How torch is installed

So that we can try to reproduce this conflict in our environment. Are installation scripts in OSS ? Could you post a link or share some of the script with us ?

@atalman @malfet I got stuck on some other work, so couldn't reply earlier.

Anyways, check the below pointers:

!pip install torch==2.2.1 and then

import jax
print(jax.default_backend())

You would see cpu as the output and the warning WARNING:jax._src.xla_bridge:CUDA backend failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

!pip install torch==2.2.1

!pip uninstall -y nvidia-nvtx-cu12 nvidia-nvjitlink-cu12 nvidia-nccl-cu12 nvidia-curand-cu12 nvidia-cufft-cu12 nvidia-cuda-runtime-cu12 nvidia-cuda-nvrtc-cu12 nvidia-cuda-cupti-cu12 nvidia-cublas-cu12 nvidia-cusparse-cu12 nvidia-cudnn-cu12 nvidia-cusolver-cu12

import jax
print(jax.default_backend())

-> It would return gpu

also,

import torch
print(torch.cuda.is_available())

will return true. (We already have CUDA 12.2 installed by APT) . I ran a bunch of torch code and I was able to access GPU just fine.

I was wodering if there is a way NOT to install the CUDA dependencies while installing torch or I could uninstall those CUDA dependencies in the script of ours ? Any thoughts or permanent solution?

malfet commented 6 months ago

I was wodering if there is a way NOT to install the CUDA dependencies while installing torch

!pip install --no-deps is an answer to your question.

or I could uninstall those CUDA dependencies in the script of ours ?

If it passes some smoke tests, I don't see the problem, i.e. CUDA-12.2 should be binary compatible with 12.1, so if torch finds all the libraries it would most likely work (would be nice to run some smoke tests though, can provide you with a small list)

Any thoughts or permanent solution?

Are you building JAX from source or install from PIP for the colab container? If former, why not do the same and build PyTorch from source as well? If later, then how JAX finds CUDA libraries it depends on?

mayankmalik-colab commented 6 months ago

I was wodering if there is a way NOT to install the CUDA dependencies while installing torch

!pip install --no-deps is an answer to your question.

or I could uninstall those CUDA dependencies in the script of ours ?

If it passes some smoke tests, I don't see the problem, i.e. CUDA-12.2 should be binary compatible with 12.1, so if torch finds all the libraries it would most likely work (would be nice to run some smoke tests though, can provide you with a small list)

Any thoughts or permanent solution?

Are you building JAX from source or install from PIP for the colab container? If former, why not do the same and build PyTorch from source as well? If later, then how JAX finds CUDA libraries it depends on?

@malfet

mayankmalik-colab commented 6 months ago

@malfet @atalman - we upgraded torch and other related packages. However, we had to remove Cuda related dependencies downloaded along with torch. So, torch is using system CUDA for now. We tested a few basic things and it seems to be working fine but if you would like to test anything, feel free to do so.

atalman commented 6 months ago

@mayankmalik-colab Could you describe how the removal Cuda related dependencies was done

kiukchung commented 6 months ago

@mayankmalik-colab Could you describe how the removal Cuda related dependencies was done

python3 -m pip uninstall -y \
        nvidia-cublas-cu12 \
        nvidia-cuda-cupti-cu12 \
        nvidia-cuda-nvrtc-cu12 \
        nvidia-cuda-runtime-cu12 \
        nvidia-cudnn-cu12 \
        nvidia-cufft-cu12 \
        nvidia-curand-cu12 \
        nvidia-cusolver-cu12 \
        nvidia-cusparse-cu12 \
        nvidia-nccl-cu12 \
        nvidia-nvjitlink-cu12 \
        nvidia-nvtx-cu12 ; \
mayankmalik-colab commented 6 months ago

@mayankmalik-colab Could you describe how the removal Cuda related dependencies was done

python3 -m pip uninstall -y \
        nvidia-cublas-cu12 \
        nvidia-cuda-cupti-cu12 \
        nvidia-cuda-nvrtc-cu12 \
        nvidia-cuda-runtime-cu12 \
        nvidia-cudnn-cu12 \
        nvidia-cufft-cu12 \
        nvidia-curand-cu12 \
        nvidia-cusolver-cu12 \
        nvidia-cusparse-cu12 \
        nvidia-nccl-cu12 \
        nvidia-nvjitlink-cu12 \
        nvidia-nvtx-cu12 ; \

Thanks @kiukchung . Yes, that's how we did. I know this is not ideal but we had to do it for now.

atalman commented 6 months ago

Here are some basic tests to see if basic functionality is there: https://github.com/pytorch/builder/blob/main/test/smoke_test/smoke_test.py With

MATRIX_GPU_ARCH_VERSION=12.1
MATRIX_GPU_ARCH_TYPE=cuda
atalman commented 6 months ago

On Pytorch side we would need to:

  1. Run smoke test in colab session, make sure they pass
  2. Recreate the colab setup in local dev environment, see if we can automate testing for colab
PaliC commented 5 months ago

Validation of fixes can found at https://github.com/pytorch/pytorch/issues/123296