microsoft / aerial_wildlife_detection

Tools for detecting wildlife in aerial images using active learning
MIT License
226 stars 58 forks source link

"libc10_cuda.so: cannot open shared object file: No such file or directory" #48

Closed MattSkiff closed 3 years ago

MattSkiff commented 3 years ago

Hi there,

I've run the dockerised installation and I hit this issue whenever I try to import a model.

Otherwise everything works fine. Does anyone have any ideas about what is going on here? Looking over the output from the docker install (and the makefile), it seems like it installs CUDA 11.0? But the models included with AIDE require CUDA 10.0?

The full error I am getting is:

File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 354, in _import_model_state_file modelClass = get_class_executable(modelLibrary) File "/home/aide/app/util/helpers.py", line 100, in get_class_executable execFile = importlib.import_module(classPath) File "/opt/conda/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 783, in exec_module File "", line 219, in _call_with_frames_removed File "/home/aide/app/ai/models/detectron2/init.py", line 6, in from .labels.torchvisionClassifier.torchvisionClassifier import GeneralizedTorchvisionClassifier File "/home/aide/app/ai/models/detectron2/labels/torchvisionClassifier/init.py", line 6, in from . import meta File "/home/aide/app/ai/models/detectron2/labels/torchvisionClassifier/meta.py", line 7, in from detectron2.data import MetadataCatalog File "/opt/conda/lib/python3.8/site-packages/detectron2/data/init.py", line 4, in from .build import ( File "/opt/conda/lib/python3.8/site-packages/detectron2/data/build.py", line 12, in from detectron2.structures import BoxMode File "/opt/conda/lib/python3.8/site-packages/detectron2/structures/init.py", line 7, in from .masks import BitMasks, PolygonMasks, polygons_to_bitmask File "/opt/conda/lib/python3.8/site-packages/detectron2/structures/masks.py", line 9, in from detectron2.layers.roi_align import ROIAlign File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/init.py", line 3, in from .deform_conv import DeformConv, ModulatedDeformConv File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/deform_conv.py", line 11, in from detectron2 import _C ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/celery/app/trace.py", line 450, in trace_task R = retval = fun(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/celery/app/trace.py", line 731, in __protected_call__ return self.run(args, **kwargs) File "/home/aide/app/modules/ModelMarketplace/backend/celery_interface.py", line 37, in import_model_uri return worker.importModelURI(project, username, modelURI, public, anonymous, forceReimport, namePolicy, customName) File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 491, in importModelURI return self.importModelFile(project, username, modelState, modelURI, public, anonymous, namePolicy, customName) File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 535, in importModelFile return self._import_model_state_file(project, fileName, modelState, stateDict, public, anonymous, namePolicy, customName) File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 375, in _import_model_state_file raise Exception(f'Model from imported state could not be launched (message: "{str(e)}").') Exception: Model from imported state could not be launched (message: "libc10_cuda.so: cannot open shared object file: No such file or directory"). File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 375, in _import_model_state_file raise Exception(f'Model from imported state could not be launched (message: "{str(e)}").') Exception: Model from imported state could not be launched (message: "libc10_cuda.so: cannot open shared object file: No such file or directory").

Does anyone have any ideas about how to resolve this error? I tried installing CUDA10 directly into the docker container, but I hit driver version mismatch errors.

Many thanks, Matthew

bkellenb commented 3 years ago

Hello Matthew,

I recently also noticed issues with software updates of PyTorch & Co. and mostly resolved them by trial and error. One reason could be that pip tries to automatically upgrade the packages during installation (flag -U in line 52 in the Dockerfile). One of the following two options might help:

  1. Specify the latest PyTorch version in the requirements.txt file (lines 17f; requires a CUDA 11-capable GPU):
    
    # old:
    torch>=1.6.0
    torchvision>=0.7.0

new:

torch==1.9.0+cu111 torchvision==0.10.0+cu111

2. If solution 1 does not work: you may try fixing the packages to the indicated versions by removing the `-U` flag. To do so, replace line 52 of the [Dockerfile](https://github.com/microsoft/aerial_wildlife_detection/blob/master/docker/Dockerfile#L52) as follows:
```Dockerfile
# old:
RUN pip install -U -r docker/requirements.txt

# new:
RUN pip install -r docker/requirements.txt

By the way: PyTorch should come with its own version of CUDA (and cuDNN) built-in; there should be no need to manually specify a CUDA Docker container. That said, the default Docker container (line 1 in the Dockerfile) already specifies CUDA 11.0; if in doubt it may be worth changing the starting container as well (e.g. to a base Ubuntu 20.04 LTS one).

MattSkiff commented 3 years ago

Thank you Ben, this pointed me in the right direction.

For reference, I fixed this by (in addition to the above, which shifted the error to missing a different CUDA file - libtorch_cuda_cu.so) changing the base image to the newest version of Pytorch (pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel) and by upgrading detectron2 in the docker image (to 0.4) (by installing directly from GitHub).

I also noticed my Python version is now 3.7.10 in the container, while previously it was 3.8 (which may have contributed to the issue - the .travis.yml file includes Python 3.5-3.7).

Update: after reinstalling the entire project again, I found only the first suggestion by Ben and updating dectron2 by reinstalling it in the container directly from GitHub was necessary. Somehow, this also fixed a separate issue I was having, which was that the screen to create a new account was not appearing.