Closed vedantroy closed 1 year ago
Is there a guideline for what the matching version of torchvision should be for a given torch commit?
In general, the master
of PyTorch is compatible with the main
of TorchVision. Nightlies are compatible for the same day. You have
[pip3] torch==1.14.0.dev20221027+cu116
and
[conda] torchvision 0.15.0a0+0dceac0 pypi_0 pypi
I'm guessing this a bug in the collection and you actually installed torchvision
from source, right? Commit should be 0dceac025615a1c2df6ec1675d8f9d7757432a49, which was only merged a couple of hours ago. Meaning, the compatible PyTorch nightly version is torch==1.14.0.dev20221213
.
Could you update that and see if the error persists?
@pmeier Thanks, will try it out. How did you figure out the commit for torch from this information by the way?
[conda] torch 1.14.0.dev20221027+cu116 pypi_0 pypi
does not contain the commit information for pytorch. And, how did you go from the pytorch commit to the compatible torchvision version?
And yes, I installed torchvision from source.
How did you figure out the commit for torch from this information by the way?
I didn't. 0dceac025615a1c2df6ec1675d8f9d7757432a49 is a torchvision
commit. The information you get here is the date of the nightly, i.e. 20221027
-> Oct 27, 2022. With that you can go the nightly
branch of pytorch/pytorch and look up the commit. For the example above, this is pytorch/pytorch@21bef8e944c90cdf98c2ead4369410db252944e1.
And, how did you go from the pytorch commit to the compatible torchvision version?
Installing from source gives you the first part of the commit hash that was build in the version, i.e. 0dceac0
. If you append that to https://github.com/pytorch/vision/commit/
, e.g. https://github.com/pytorch/vision/commit/0dceac0, you can find it on GH. Looking at the nightly
branch for torchvision
, you find it for Dec 13, 2022. Meaning, the compatible torch
nightly is the one from the same day, i.e. torch==1.14.0.dev20221213
.
Please note, that this lookup from commit to nightly date is not guaranteed to work. Above we got lucky since the commit we were looking for was actually the last one that was included in that nightly. In general that does not need to be the case. So you often need some back and forth to find the correct date.
Hm, there's no torch with version 1.14.0.dev20221213+cu116
, I can probably build without cuda support -- but -- I do need cuda.
Not sure what to do here. Will try to build with cuda+117 and see if that helps.
Update: no cuda+117 either.
My bad, they switched the versioning scheme for the upcoming 2.0 release. The nightly you are looking for is torch-2.0.0.dev20221213+cu116
. You can
pip install https://download.pytorch.org/whl/nightly/cu116/torch-2.0.0.dev20221213%2Bcu116-cp39-cp39-linux_x86_64.whl
Still getting the error. It also seems like packages like torch==1.14.0.dev20221027+cu116
have been removed from pypi?
New collect_env output:
Python version: 3.9.15 (main, Nov 24 2022, 14:31:59) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-boto3-cloudformation==1.25.4
[pip3] mypy-boto3-dynamodb==1.25.0
[pip3] mypy-boto3-ec2==1.25.5
[pip3] mypy-boto3-lambda==1.25.0
[pip3] mypy-boto3-rds==1.25.1
[pip3] mypy-boto3-s3==1.25.0
[pip3] mypy-boto3-sqs==1.25.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20221213+cu116
[pip3] torch-tb-profiler==0.4.0
[pip3] torchdata==0.6.0.dev20230101
[pip3] torchsnapshot-nightly==2022.10.29
[pip3] torchtriton==2.0.0+0d7e753227
[pip3] torchvision==0.15.0a0+edb3a80
[conda] numpy 1.23.5 pypi_0 pypi
[conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi
[conda] torch 2.0.0.dev20221213+cu116 pypi_0 pypi
[conda] torch-tb-profiler 0.4.0 pypi_0 pypi
[conda] torchdata 0.6.0.dev20230101 pypi_0 pypi
[conda] torchsnapshot-nightly 2022.10.29 pypi_0 pypi
[conda] torchtriton 2.0.0+0d7e753227 pypi_0 pypi
[conda] torchvision 0.15.0a0+edb3a80 pypi_0 pypi
Still getting the error.
Your env is still broken. The torchvision
commit is from Oct 25, 2022 while the torch
nightly is from Dec 13, 2022. Please get that sorted out for example by a clean install before we debug any further.
It also seems like packages like
torch==1.14.0.dev20221027+cu116
have been removed from pypi?
The nightly releases have never been on PyPI, but only on our index. This is why you have to use the --extra-index-url
option when installing.
Here's a fixed version (I think):
pip3] torch==2.0.0.dev20230120+cu116
[pip3] torch-tb-profiler==0.4.0
[pip3] torchdata==0.6.0.dev20230101
[pip3] torchsnapshot-nightly==2022.10.29
[pip3] torchvision==0.15.0a0+d2d448c
[conda] numpy 1.23.5 pypi_0 pypi
[conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi
[conda] torch 2.0.0.dev20230120+cu116 pypi_0 pypi
[conda] torch-tb-profiler 0.4.0 pypi_0 pypi
[conda] torchdata 0.6.0.dev20230101 pypi_0 pypi
[conda] torchsnapshot-nightly 2022.10.29 pypi_0 pypi
[conda] torchvision 0.15.0a0+d2d448c pypi_0 pypi
Pillow/Pillow-SIMD version: 9.0.0.post1
torchvision is using commit: https://github.com/pytorch/vision/commit/d2d448c71b4cb054d160000a0f63eecad7867bdb, which I believe is a commit on January 20th at 7:58 AM EST. Meanwhile, torch is using 20230120, which I think is "2023-01-20".
Yet, I'm still getting the same error.
@pmeier
Thanks for confirming. Could you look for the image.so
file in the installed torchvision
folder? It should be directly in there like lib/python3.X/site-packages/torchvision/image.so
. Or if you used an editable install, torchvision/image.so
in the repository. If it is there, could you post the output of ldd image.so
?
Thanks for confirming. Could you look for the
image.so
file in the installedtorchvision
folder? It should be directly in there likelib/python3.X/site-packages/torchvision/image.so
. Or if you used an editable install,torchvision/image.so
in the repository. If it is there, could you post the output ofldd image.so
?
Here it is:
linux-vdso.so.1 (0x00007fff693ed000)
libpng16.so.16 => /home/ray/anaconda3/lib/libpng16.so.16 (0x00007f7a0e3f1000)
libjpeg.so.8 => /home/ray/anaconda3/lib/libjpeg.so.8 (0x00007f7a0e345000)
libc10.so => not found
libtorch_cpu.so => not found
libstdc++.so.6 => /home/ray/anaconda3/lib/libstdc++.so.6 (0x00007f7a0e12b000)
libgcc_s.so.1 => /home/ray/anaconda3/lib/libgcc_s.so.1 (0x00007f7a0e112000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f7a0e0ed000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f7a0defb000)
libz.so.1 => /home/ray/anaconda3/lib/./libz.so.1 (0x00007f7a0dedd000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f7a0dd8e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7a0e44d000)
So far I cannot reproduce your error. Here is what I did:
docker run -it ubuntu:20.04
Miniconda
conda create -n tv-7036 python=3.9 gcc_linux-64 gxx_linux-64 ninja libpng jpeg numpy
pip install https://download.pytorch.org/whl/nightly/cu116/torch-2.0.0.dev20230120%2Bcu116-cp39-cp39-linux_x86_64.whl --index-url https://download.pytorch.org/whl/nightly/cu116
git clone https://github.com/pytorch/vision
and git checkout d2d448c71b4cb054d160000a0f63eecad7867bdb
python setup.py install
The setup completes without issues. If I now do python -c import torchvision
, I get
/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
/vision/torchvision/__init__.py:25: UserWarning: You are importing torchvision within its own root folder (/vision). This is not expected to work and may give errors. Please exit the torchvision project source and relaunch your python interpreter.
warnings.warn(message.format(os.getcwd()))
That is expected (Python prioritizes the local torchvision
folder over what is installed) and the warning tells us what to do: change to a different directory and do it again. After that the command comes back clean. For comparison here is the output of ldd
8c-py3.9-linux-x86_64.egg/torchvision# ldd image.so
linux-vdso.so.1 (0x00007ffc865c3000)
libpng16.so.16 => /root/miniconda3/envs/tv-7036/lib/libpng16.so.16 (0x00007f3063c48000)
libjpeg.so.9 => /root/miniconda3/envs/tv-7036/lib/libjpeg.so.9 (0x00007f3063c0a000)
libc10.so => not found
libtorch_cpu.so => not found
libstdc++.so.6 => /root/miniconda3/envs/tv-7036/lib/libstdc++.so.6 (0x00007f30639f3000)
libgcc_s.so.1 => /root/miniconda3/envs/tv-7036/lib/libgcc_s.so.1 (0x00007f30639d9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f30637e5000)
libz.so.1 => /root/miniconda3/envs/tv-7036/lib/./libz.so.1 (0x00007f30637c7000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3063678000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3063ca1000)
Looking at your error message undefined symbol: _ZN3c106detail19maybe_wrap_dim_slowEllb
I still think this is a problem with your env. We don't have a maybe_wrap_dim_slow
in our repository, but PyTorch has. The commit that dealt with that is roughly 3 months old which coincides with your initial report that used binaries from late Oct 2022. Has the error message changed somehow since the original report? Otherwise, my guess is that you still have the old torch
somewhere and it is shadowing the new one.
How did you install PyTorch? Did you use a clean environment?
@pmeier thanks for the very detailed instructions.
Here is a simple Dockerfile that reproduces the issue (my main Dockerfile is much too complex):
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y software-properties-common ninja-build git curl
RUN add-apt-repository ppa:deadsnakes/ppa
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
RUN apt-get update && apt-get install -y python3.9 python3.9-distutils python3.9-dev
RUN curl -fsSL https://bootstrap.pypa.io/get-pip.py | python3.9 - && \
pip3.9 install --no-cache-dir --upgrade pip setuptools
RUN apt install zlib1g-dev
# I believe you can comment out the Triton installation & the error will still reproduce
RUN git clone https://github.com/openai/triton.git \
&& cd triton/python \
&& git checkout d3e753b5c00bbae855b283adf3d3a5d6d1547830 \
&& python3.9 -m pip install cmake \
&& python3.9 -m pip wheel --wheel-dir /tmp/dist . --verbose
RUN pip install torch==2.0.0.dev20230120+cu118 --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu118
RUN pip install --no-deps git+https://github.com/pytorch/vision.git@d2d448c71b4cb054d160000a0f63eecad7867bdb
RUN pip install --no-deps $(ls /tmp/dist | grep triton | xargs -I {} echo /tmp/dist/{})
RUN ln -s $(which python3.9) /usr/bin/python
20230120
with commit https://github.com/pytorch/vision/commit/d2d448c71b4cb054d160000a0f63eecad7867bdb in the Dockerfile.The error I get on load is:
/usr/local/lib/python3.9/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
Although, this time the dim_slow
message is not appearing.
import torchvision
leads to ModuleNotFoundError: No module named 'numpy'
. That comes as no surprise since you are installed with --no-deps
set.pip install
'ing torchvision
is not officially supported. Please use python setup.py install
image.so
. Meaning, you most likely didn't have libpng
or libjpeg
installed at build time and thus the setup simply built without the image extension. This also means, the ldd
output from https://github.com/pytorch/vision/issues/7036#issuecomment-1405940025 did not come from the container in question.Here is a Dockerfile that strips out the unnecessary things from yours and builds the image extension just fine
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y software-properties-common ninja-build git curl
RUN add-apt-repository ppa:deadsnakes/ppa
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
RUN apt-get update && apt-get install -y python3.9 python3.9-distutils python3.9-dev
RUN apt-get update && apt-get install -y libjpeg-dev libpng-dev
RUN curl -fsSL https://bootstrap.pypa.io/get-pip.py | python3.9 - && \
pip3.9 install --no-cache-dir --upgrade pip setuptools
RUN pip install torch==2.0.0.dev20230120+cu118 --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu118
RUN git clone https://github.com/pytorch/vision.git
WORKDIR /vision
RUN git checkout d2d448c71b4cb054d160000a0f63eecad7867bdb
# The version installed on the system is something like `1.build1`, which cannot be parsed by
# `pkg_resources.get_distribution` and ultimately fails the install
RUN pip install --upgrade distro-info==1.0
# I had to manually specify the library path, because otherwise torchvision would be installed in
# /usr/lib/python3.9/site-packages/, which is not recognized by the system interpreter
RUN python3.9 setup.py install --install-lib /usr/local/lib/python3.9/dist-packages/
WORKDIR /
RUN ln -s $(which python3.9) /usr/bin/python
Since the issue is either an environment problem or stems from the fact that you are using an unsupported installation command, I'm closing this. Please make sure to include all the relevant information from the get go to avoid us chasing ghosts while debugging.
Same here. with latest torchvision version 0.16.2
🐛 Describe the bug
gives the warning:
Versions
Collecting environment information... PyTorch version: 1.14.0.dev20221027+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31
Python version: 3.9.15 (main, Nov 24 2022, 14:31:59) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-53-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti GPU 1: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 515.65.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] mypy-boto3-cloudformation==1.25.4 [pip3] mypy-boto3-dynamodb==1.25.0 [pip3] mypy-boto3-ec2==1.25.5 [pip3] mypy-boto3-lambda==1.25.0 [pip3] mypy-boto3-rds==1.25.1 [pip3] mypy-boto3-s3==1.25.0 [pip3] mypy-boto3-sqs==1.25.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] torch==1.14.0.dev20221027+cu116 [pip3] torch-tb-profiler==0.4.0 [pip3] torchdata==0.6.0.dev20221027 [pip3] torchsnapshot-nightly==2022.10.29 [pip3] torchvision==0.15.0a0+0dceac0 [conda] numpy 1.23.5 pypi_0 pypi [conda] torch 1.14.0.dev20221027+cu116 pypi_0 pypi [conda] torch-tb-profiler 0.4.0 pypi_0 pypi [conda] torchdata 0.6.0.dev20221027 pypi_0 pypi [conda] torchsnapshot-nightly 2022.10.29 pypi_0 pypi [conda] torchvision 0.15.0a0+0dceac0 pypi_0 pypi