Failed to load image extension

vedantroy commented 1 year ago

🐛 Describe the bug

import torchvision

gives the warning:

/home/ray/anaconda3/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /home/ray/anaconda3/lib/python3.9/site-packages/torchvision/image.so: undefined symbol: _ZN3c106detail19maybe_wrap_dim_slowEllb

Versions

Collecting environment information... PyTorch version: 1.14.0.dev20221027+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.9.15 (main, Nov 24 2022, 14:31:59) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-53-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti GPU 1: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 515.65.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] mypy-boto3-cloudformation==1.25.4 [pip3] mypy-boto3-dynamodb==1.25.0 [pip3] mypy-boto3-ec2==1.25.5 [pip3] mypy-boto3-lambda==1.25.0 [pip3] mypy-boto3-rds==1.25.1 [pip3] mypy-boto3-s3==1.25.0 [pip3] mypy-boto3-sqs==1.25.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] torch==1.14.0.dev20221027+cu116 [pip3] torch-tb-profiler==0.4.0 [pip3] torchdata==0.6.0.dev20221027 [pip3] torchsnapshot-nightly==2022.10.29 [pip3] torchvision==0.15.0a0+0dceac0 [conda] numpy 1.23.5 pypi_0 pypi [conda] torch 1.14.0.dev20221027+cu116 pypi_0 pypi [conda] torch-tb-profiler 0.4.0 pypi_0 pypi [conda] torchdata 0.6.0.dev20221027 pypi_0 pypi [conda] torchsnapshot-nightly 2022.10.29 pypi_0 pypi [conda] torchvision 0.15.0a0+0dceac0 pypi_0 pypi

vedantroy commented 1 year ago

Is there a guideline for what the matching version of torchvision should be for a given torch commit?

pmeier commented 1 year ago

In general, the master of PyTorch is compatible with the main of TorchVision. Nightlies are compatible for the same day. You have

[pip3] torch==1.14.0.dev20221027+cu116

and

[conda] torchvision 0.15.0a0+0dceac0 pypi_0 pypi

I'm guessing this a bug in the collection and you actually installed torchvision from source, right? Commit should be 0dceac025615a1c2df6ec1675d8f9d7757432a49, which was only merged a couple of hours ago. Meaning, the compatible PyTorch nightly version is torch==1.14.0.dev20221213.

Could you update that and see if the error persists?

vedantroy commented 1 year ago

@pmeier Thanks, will try it out. How did you figure out the commit for torch from this information by the way?

[conda] torch 1.14.0.dev20221027+cu116 pypi_0 pypi

does not contain the commit information for pytorch. And, how did you go from the pytorch commit to the compatible torchvision version?

And yes, I installed torchvision from source.

pmeier commented 1 year ago

How did you figure out the commit for torch from this information by the way?

I didn't. 0dceac025615a1c2df6ec1675d8f9d7757432a49 is a torchvision commit. The information you get here is the date of the nightly, i.e. 20221027 -> Oct 27, 2022. With that you can go the nightly branch of pytorch/pytorch and look up the commit. For the example above, this is pytorch/pytorch@21bef8e944c90cdf98c2ead4369410db252944e1.

And, how did you go from the pytorch commit to the compatible torchvision version?

Installing from source gives you the first part of the commit hash that was build in the version, i.e. 0dceac0. If you append that to https://github.com/pytorch/vision/commit/, e.g. https://github.com/pytorch/vision/commit/0dceac0, you can find it on GH. Looking at the nightly branch for torchvision, you find it for Dec 13, 2022. Meaning, the compatible torch nightly is the one from the same day, i.e. torch==1.14.0.dev20221213.

Please note, that this lookup from commit to nightly date is not guaranteed to work. Above we got lucky since the commit we were looking for was actually the last one that was included in that nightly. In general that does not need to be the case. So you often need some back and forth to find the correct date.

vedantroy commented 1 year ago

Hm, there's no torch with version 1.14.0.dev20221213+cu116, I can probably build without cuda support -- but -- I do need cuda.

Not sure what to do here. Will try to build with cuda+117 and see if that helps.

Update: no cuda+117 either.

pmeier commented 1 year ago

My bad, they switched the versioning scheme for the upcoming 2.0 release. The nightly you are looking for is torch-2.0.0.dev20221213+cu116. You can

pip install https://download.pytorch.org/whl/nightly/cu116/torch-2.0.0.dev20221213%2Bcu116-cp39-cp39-linux_x86_64.whl

vedantroy commented 1 year ago

Still getting the error. It also seems like packages like torch==1.14.0.dev20221027+cu116 have been removed from pypi?

New collect_env output:

Python version: 3.9.15 (main, Nov 24 2022, 14:31:59)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-boto3-cloudformation==1.25.4
[pip3] mypy-boto3-dynamodb==1.25.0
[pip3] mypy-boto3-ec2==1.25.5
[pip3] mypy-boto3-lambda==1.25.0
[pip3] mypy-boto3-rds==1.25.1
[pip3] mypy-boto3-s3==1.25.0
[pip3] mypy-boto3-sqs==1.25.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20221213+cu116
[pip3] torch-tb-profiler==0.4.0
[pip3] torchdata==0.6.0.dev20230101
[pip3] torchsnapshot-nightly==2022.10.29
[pip3] torchtriton==2.0.0+0d7e753227
[pip3] torchvision==0.15.0a0+edb3a80
[conda] numpy                     1.23.5                   pypi_0    pypi
[conda] pytorch-triton            2.0.0+0d7e753227          pypi_0    pypi
[conda] torch                     2.0.0.dev20221213+cu116          pypi_0    pypi
[conda] torch-tb-profiler         0.4.0                    pypi_0    pypi
[conda] torchdata                 0.6.0.dev20230101          pypi_0    pypi
[conda] torchsnapshot-nightly     2022.10.29               pypi_0    pypi
[conda] torchtriton               2.0.0+0d7e753227          pypi_0    pypi
[conda] torchvision               0.15.0a0+edb3a80          pypi_0    pypi

pmeier commented 1 year ago

Still getting the error.

Your env is still broken. The torchvision commit is from Oct 25, 2022 while the torch nightly is from Dec 13, 2022. Please get that sorted out for example by a clean install before we debug any further.

It also seems like packages like torch==1.14.0.dev20221027+cu116 have been removed from pypi?

The nightly releases have never been on PyPI, but only on our index. This is why you have to use the --extra-index-url option when installing.

vedantroy commented 1 year ago

Here's a fixed version (I think):

pip3] torch==2.0.0.dev20230120+cu116
[pip3] torch-tb-profiler==0.4.0
[pip3] torchdata==0.6.0.dev20230101
[pip3] torchsnapshot-nightly==2022.10.29
[pip3] torchvision==0.15.0a0+d2d448c
[conda] numpy                     1.23.5                   pypi_0    pypi
[conda] pytorch-triton            2.0.0+0d7e753227          pypi_0    pypi
[conda] torch                     2.0.0.dev20230120+cu116          pypi_0    pypi
[conda] torch-tb-profiler         0.4.0                    pypi_0    pypi
[conda] torchdata                 0.6.0.dev20230101          pypi_0    pypi
[conda] torchsnapshot-nightly     2022.10.29               pypi_0    pypi
[conda] torchvision               0.15.0a0+d2d448c          pypi_0    pypi
Pillow/Pillow-SIMD version: 9.0.0.post1

torchvision is using commit: https://github.com/pytorch/vision/commit/d2d448c71b4cb054d160000a0f63eecad7867bdb, which I believe is a commit on January 20th at 7:58 AM EST. Meanwhile, torch is using 20230120, which I think is "2023-01-20".

Yet, I'm still getting the same error.

@pmeier

pmeier commented 1 year ago

Thanks for confirming. Could you look for the image.so file in the installed torchvision folder? It should be directly in there like lib/python3.X/site-packages/torchvision/image.so. Or if you used an editable install, torchvision/image.so in the repository. If it is there, could you post the output of ldd image.so?

vedantroy commented 1 year ago

Thanks for confirming. Could you look for the image.so file in the installed torchvision folder? It should be directly in there like lib/python3.X/site-packages/torchvision/image.so. Or if you used an editable install, torchvision/image.so in the repository. If it is there, could you post the output of ldd image.so?

Here it is:

        linux-vdso.so.1 (0x00007fff693ed000)
        libpng16.so.16 => /home/ray/anaconda3/lib/libpng16.so.16 (0x00007f7a0e3f1000)
        libjpeg.so.8 => /home/ray/anaconda3/lib/libjpeg.so.8 (0x00007f7a0e345000)
        libc10.so => not found
        libtorch_cpu.so => not found
        libstdc++.so.6 => /home/ray/anaconda3/lib/libstdc++.so.6 (0x00007f7a0e12b000)
        libgcc_s.so.1 => /home/ray/anaconda3/lib/libgcc_s.so.1 (0x00007f7a0e112000)
        libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f7a0e0ed000)
        libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f7a0defb000)
        libz.so.1 => /home/ray/anaconda3/lib/./libz.so.1 (0x00007f7a0dedd000)
        libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f7a0dd8e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f7a0e44d000)

pmeier commented 1 year ago

So far I cannot reproduce your error. Here is what I did:

Use a fresh container docker run -it ubuntu:20.04
Download and install Miniconda
Create a fresh env with the dependencies in them conda create -n tv-7036 python=3.9 gcc_linux-64 gxx_linux-64 ninja libpng jpeg numpy
Install the specific PyTorch nightly wheel through pip with the official instructions pip install https://download.pytorch.org/whl/nightly/cu116/torch-2.0.0.dev20230120%2Bcu116-cp39-cp39-linux_x86_64.whl --index-url https://download.pytorch.org/whl/nightly/cu116
git clone https://github.com/pytorch/vision and git checkout d2d448c71b4cb054d160000a0f63eecad7867bdb
python setup.py install

The setup completes without issues. If I now do python -c import torchvision, I get

/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
/vision/torchvision/__init__.py:25: UserWarning: You are importing torchvision within its own root folder (/vision). This is not expected to work and may give errors. Please exit the torchvision project source and relaunch your python interpreter.
  warnings.warn(message.format(os.getcwd()))

That is expected (Python prioritizes the local torchvision folder over what is installed) and the warning tells us what to do: change to a different directory and do it again. After that the command comes back clean. For comparison here is the output of ldd

8c-py3.9-linux-x86_64.egg/torchvision# ldd image.so
        linux-vdso.so.1 (0x00007ffc865c3000)
        libpng16.so.16 => /root/miniconda3/envs/tv-7036/lib/libpng16.so.16 (0x00007f3063c48000)
        libjpeg.so.9 => /root/miniconda3/envs/tv-7036/lib/libjpeg.so.9 (0x00007f3063c0a000)
        libc10.so => not found
        libtorch_cpu.so => not found
        libstdc++.so.6 => /root/miniconda3/envs/tv-7036/lib/libstdc++.so.6 (0x00007f30639f3000)
        libgcc_s.so.1 => /root/miniconda3/envs/tv-7036/lib/libgcc_s.so.1 (0x00007f30639d9000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f30637e5000)
        libz.so.1 => /root/miniconda3/envs/tv-7036/lib/./libz.so.1 (0x00007f30637c7000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3063678000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3063ca1000)

Looking at your error message undefined symbol: _ZN3c106detail19maybe_wrap_dim_slowEllb I still think this is a problem with your env. We don't have a maybe_wrap_dim_slow in our repository, but PyTorch has. The commit that dealt with that is roughly 3 months old which coincides with your initial report that used binaries from late Oct 2022. Has the error message changed somehow since the original report? Otherwise, my guess is that you still have the old torch somewhere and it is shadowing the new one.

How did you install PyTorch? Did you use a clean environment?

vedantroy commented 1 year ago

@pmeier thanks for the very detailed instructions.

Here is a simple Dockerfile that reproduces the issue (my main Dockerfile is much too complex):

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y software-properties-common ninja-build git curl
RUN add-apt-repository ppa:deadsnakes/ppa
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
RUN apt-get update && apt-get install -y python3.9 python3.9-distutils python3.9-dev
RUN curl -fsSL https://bootstrap.pypa.io/get-pip.py | python3.9 - && \
    pip3.9 install --no-cache-dir --upgrade pip setuptools
RUN apt install zlib1g-dev
# I believe you can comment out the Triton installation & the error will still reproduce
RUN git clone https://github.com/openai/triton.git \
    && cd triton/python \
    && git checkout d3e753b5c00bbae855b283adf3d3a5d6d1547830 \
    && python3.9 -m pip install cmake \ 
    && python3.9 -m pip wheel --wheel-dir /tmp/dist . --verbose

RUN pip install torch==2.0.0.dev20230120+cu118 --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu118 
RUN pip install --no-deps git+https://github.com/pytorch/vision.git@d2d448c71b4cb054d160000a0f63eecad7867bdb
RUN pip install --no-deps $(ls /tmp/dist | grep triton | xargs -I {} echo /tmp/dist/{}) 

RUN ln -s $(which python3.9) /usr/bin/python

I'm installing version 20230120 with commit https://github.com/pytorch/vision/commit/d2d448c71b4cb054d160000a0f63eecad7867bdb in the Dockerfile.

The error I get on load is:

/usr/local/lib/python3.9/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")

Although, this time the dim_slow message is not appearing.

pmeier commented 1 year ago

The Dockerfile you have posted is not complete. Trying to import torchvision leads to ModuleNotFoundError: No module named 'numpy'. That comes as no surprise since you are installed with --no-deps set.
pip install'ing torchvision is not officially supported. Please use python setup.py install
Looking into the installed source in your container reveals that there is no image.so. Meaning, you most likely didn't have libpng or libjpeg installed at build time and thus the setup simply built without the image extension. This also means, the ldd output from https://github.com/pytorch/vision/issues/7036#issuecomment-1405940025 did not come from the container in question.

Here is a Dockerfile that strips out the unnecessary things from yours and builds the image extension just fine

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y software-properties-common ninja-build git curl
RUN add-apt-repository ppa:deadsnakes/ppa
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
RUN apt-get update && apt-get install -y python3.9 python3.9-distutils python3.9-dev
RUN apt-get update && apt-get install -y libjpeg-dev libpng-dev
RUN curl -fsSL https://bootstrap.pypa.io/get-pip.py | python3.9 - && \
    pip3.9 install --no-cache-dir --upgrade pip setuptools

RUN pip install torch==2.0.0.dev20230120+cu118 --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu118
RUN git clone https://github.com/pytorch/vision.git
WORKDIR /vision
RUN git checkout d2d448c71b4cb054d160000a0f63eecad7867bdb
# The version installed on the system is something like `1.build1`, which cannot be parsed by
# `pkg_resources.get_distribution` and ultimately fails the install
RUN pip install --upgrade distro-info==1.0
# I had to manually specify the library path, because otherwise torchvision would be installed in
# /usr/lib/python3.9/site-packages/, which is not recognized by the system interpreter
RUN python3.9 setup.py install --install-lib /usr/local/lib/python3.9/dist-packages/
WORKDIR /

RUN ln -s $(which python3.9) /usr/bin/python

Since the issue is either an environment problem or stems from the fact that you are using an unsupported installation command, I'm closing this. Please make sure to include all the relevant information from the get go to avoid us chasing ghosts while debugging.

Romeo-CC commented 9 months ago

Same here. with latest torchvision version 0.16.2

pytorch / vision

Failed to load image extension #7036

🐛 Describe the bug

Versions