utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.86k stars 341 forks source link

Not finding CUDA when building docker image #287

Open Labulitiolle opened 3 years ago

Labulitiolle commented 3 years ago

https://github.com/utterworks/fast-bert/blob/439739cd821fae4c6f096ecff86c7d00f8be6004/container/Dockerfile#L66

When running the container/build_and_push.sh script, I get the following error:

> [13/19] RUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext:                                                                         
#16 0.730 Cloning into 'apex'...                                                                                                                                                                
#16 6.454 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'                                                                                                                           
#16 6.454 /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)                                                         
#16 6.454   return torch._C._cuda_getDeviceCount() > 0
#16 6.456 
#16 6.456 Warning: Torch did not find available GPUs on this system.
#16 6.456  If your intention is to cross-compile, this is not an error.
#16 6.456 By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
#16 6.456 Volta (compute capability 7.0), Turing (compute capability 7.5),
#16 6.456 and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
#16 6.456 If you wish to cross-compile for a single specific architecture,
#16 6.456 export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
#16 6.456 
#16 6.466 
#16 6.466 
#16 6.466 torch.__version__  = 1.7.1

... (skipping links)...

#12 1.206 Collecting torch
#12 1.207   Created temporary directory: /tmp/pip-unpack-3pmclieq
#12 1.209   Looking up "https://files.pythonhosted.org/packages/56/74/6fc9dee50f7c93d6b7d9644554bdc9692f3023fa5d1de779666e6bf8ae76/torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl" in the cache
#12 1.210   No cache entry available
#12 1.211   Starting new HTTPS connection (1): files.pythonhosted.org:443
#12 1.368   https://files.pythonhosted.org:443 "GET /packages/56/74/6fc9dee50f7c93d6b7d9644554bdc9692f3023fa5d1de779666e6bf8ae76/torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl HTTP/1.1" 200 804097215
#12 1.371   Downloading torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1 MB)
#12 74.25   Ignoring unknown cache-control directive: immutable
#12 74.25   Updating cache with response from "https://files.pythonhosted.org/packages/56/74/6fc9dee50f7c93d6b7d9644554bdc9692f3023fa5d1de779666e6bf8ae76/torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl"
#12 74.25   Caching due to etag
#12 80.17 Killed
------
executor failed running [/bin/sh -c pip install --trusted-host pypi.python.org -v --log /tmp/pip.log torch torchvision]: exit code: 137

Is there a versioning mismatch between torch and CUDA or should the cache directory be defined?