The detected CUDA version (11.8) mismatches the version that was used to compile

juanmf commented 9 months ago

$ docker build deploy --build-arg MAX_JOBS=8
...
...

 => [4/8] RUN pip3 install "torch>=2.0.0"                                                                                                                                                            476.6s
 => ERROR [5/8] RUN git clone https://github.com/NVIDIA/apex &&     cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 &&     sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME  15.9s 
-
...
14.87 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
14.90 
14.90 Warning: Torch did not find available GPUs on this system.
14.90  If your intention is to cross-compile, this is not an error.
14.90 By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
14.90 Volta (compute capability 7.0), Turing (compute capability 7.5),
14.90 and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
14.90 If you wish to cross-compile for a single specific architecture,
14.90 export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
...
15.12   File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run
15.12     self.build_extensions()
15.12   File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 525, in build_extensions
15.12     _check_cuda_version(compiler_name, compiler_version)
15.12   File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 413, in _check_cuda_version
15.12     raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
15.12 RuntimeError: 
15.12 The detected CUDA version (11.8) mismatches the version that was used to compile
15.12 PyTorch (12.1). Please make sure to use the same CUDA versions.

Am I supposed to install CUDA 1st? does it make sense in a MacBook Pro 2017?

trying $ export TORCH_CUDA_ARCH_LIST="8.0" didn't help.

juanmf commented 9 months ago

Changed 1st line in Dockerfile and it's looking good. https://github.com/mistralai/mistral-src/blob/147c4e68279b90eb61b19bdea44e16f5539d5a5d/deploy/Dockerfile#L1 to: FROM --platform=amd64 nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04 as base

 => [internal] load metadata for nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04                                                                                                                          1.8s
 => [1/8] FROM nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04@sha256:e3a8f7b933e77ecee74731198a2a5483e965b585cea2660675cf4bb152237e9b                                                                  236.0s
 => => resolve nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04@sha256:e3a8f7b933e77ecee74731198a2a5483e965b585cea2660675cf4bb152237e9b   
...
 => [5/8] RUN git clone https://github.com/NVIDIA/apex &&     cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 &&     sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d'  1311.1s

Running...

Looks like lines FROM --platform=amd64 nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu22.04 as base & RUN pip3 install "torch>=2.0.0" is bound to cause this issue as torch keeps having upgrades.

juanmf commented 9 months ago

that was it

mistralai / mistral-inference

The detected CUDA version (11.8) mismatches the version that was used to compile #76