torchvision[cuda10] installs default pytorch build, instead of cuda10 build

pytorch / vision

Datasets, Transforms and Models specific to Computer Vision

https://pytorch.org/vision

BSD 3-Clause "New" or "Revised" License

15.99k stars 6.92k forks source link

torchvision[cuda10] installs default pytorch build, instead of cuda10 build #949

Open barrh opened 5 years ago

barrh commented 5 years ago

torchvision and pytorch are required to be built in similar environment. Currently, when calling pip-install on cuda10 build of torchvision, (if torch is not already installed) it implicitly installs the default Pytorch build, which - as of today - is built against cuda9, resulting in raising an error during run-time.

e.g. pip install torchvision[cuda10_package] torch[cuda10_package] - will work but: pip install torchvision[cuda10_package]; pip install torch[cuda10_package] or similarily, pip install torch[cuda10_package] torchvision; pip install --force torchvision[cuda10_package] will not. (order matters)

Can the requirements of each build be specified in such a way that pip will automatically look for the matching build of pytorch? Alternatively, is it possible for pip to independently decide the correct build based on environment variables?

fmassa commented 5 years ago

Good question. I think that with conda everything should work out fine, given that we have separated out the CUDA version into a different package.

I don't know how to solve the pip issue though. @soumith do you have an idea?

soumith commented 5 years ago

@barrh can you say what EXACT commands you are using in each case. It's not clear to me what you have tried, so getting clarity on that will help me understand what's going on.

barrh commented 5 years ago

@soumith The following installation scenarios will fail to run on cuda10 system: 1. pip install https://download.pytorch.org/whl/cu100/torchvision-0.3.0-cp36-cp36m-linux_x86_64.whl; pip install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp36m-linux_x86_64.whl 2. pip install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp36m-linux_x86_64.whl torchvision; pip install --force https://download.pytorch.org/whl/cu100/torchvision-0.3.0-cp36-cp36m-linux_x86_64.whl

Edit: 2nd example.

soumith commented 5 years ago

hmmm, yea there isn't a easy pip fix to it, but there is a way to fix this. I opened an issue on PyTorch to address it, but it will take a bit of effort from my side https://github.com/pytorch/pytorch/issues/19990

bblakeslee-maker commented 5 years ago

I ran into this issue over the weekend. It does have some consequences, as it causes some of the example PyTorch models (like the fast style transfer code sample) to fail to train. Unfortunately, I don't have the exact error message at hand, but it was a runtime error related to the CUBLAS library. It triggers on the features.bmm call in the gram_matrix function of the utils.py file (line 25).

However, I did find a temporary workaround. The official PyTorch site says to install torch first, then torchvision. If you reverse the install commands, then the incorrect version of torch will be installed with torchvision. Running the torch install command then overwrites the wrong torch version with the correct one. I haven't tested it thoroughly, but it does cause torch.version.cuda to report the correct version number (10.0.130) and the fast style transfer code starts to train.

necromuralist commented 1 year ago

I just ran into (mostly) the same problem, although the versions are different since it's been three years . When I used the command given in pytorch's instructions for Linux (Ubuntu 22.10), Stable (1.12.1), and CUDA 11.6:

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

it installed the right version of torchvision (0.31.1+cu116) to get CUDA 11.6 but the older CUDA version for torch (1.12.1+cu102). @tyrian411's solution of re-installing torch by itself fixed it. Perhaps the documentation should be annotated in case people run into this again.