Closed MattSkiff closed 3 years ago
Hello Matthew,
I recently also noticed issues with software updates of PyTorch & Co. and mostly resolved them by trial and error.
One reason could be that pip tries to automatically upgrade the packages during installation (flag -U
in line 52 in the Dockerfile).
One of the following two options might help:
# old:
torch>=1.6.0
torchvision>=0.7.0
torch==1.9.0+cu111 torchvision==0.10.0+cu111
2. If solution 1 does not work: you may try fixing the packages to the indicated versions by removing the `-U` flag. To do so, replace line 52 of the [Dockerfile](https://github.com/microsoft/aerial_wildlife_detection/blob/master/docker/Dockerfile#L52) as follows:
```Dockerfile
# old:
RUN pip install -U -r docker/requirements.txt
# new:
RUN pip install -r docker/requirements.txt
By the way: PyTorch should come with its own version of CUDA (and cuDNN) built-in; there should be no need to manually specify a CUDA Docker container. That said, the default Docker container (line 1 in the Dockerfile) already specifies CUDA 11.0; if in doubt it may be worth changing the starting container as well (e.g. to a base Ubuntu 20.04 LTS one).
Thank you Ben, this pointed me in the right direction.
For reference, I fixed this by (in addition to the above, which shifted the error to missing a different CUDA file - libtorch_cuda_cu.so) changing the base image to the newest version of Pytorch (pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel) and by upgrading detectron2 in the docker image (to 0.4) (by installing directly from GitHub).
I also noticed my Python version is now 3.7.10 in the container, while previously it was 3.8 (which may have contributed to the issue - the .travis.yml file includes Python 3.5-3.7).
Update: after reinstalling the entire project again, I found only the first suggestion by Ben and updating dectron2 by reinstalling it in the container directly from GitHub was necessary. Somehow, this also fixed a separate issue I was having, which was that the screen to create a new account was not appearing.
Hi there,
I've run the dockerised installation and I hit this issue whenever I try to import a model.
Otherwise everything works fine. Does anyone have any ideas about what is going on here? Looking over the output from the docker install (and the makefile), it seems like it installs CUDA 11.0? But the models included with AIDE require CUDA 10.0?
The full error I am getting is:
File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 354, in _import_model_state_file modelClass = get_class_executable(modelLibrary) File "/home/aide/app/util/helpers.py", line 100, in get_class_executable execFile = importlib.import_module(classPath) File "/opt/conda/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 783, in exec_module
File "", line 219, in _call_with_frames_removed
File "/home/aide/app/ai/models/detectron2/init.py", line 6, in
from .labels.torchvisionClassifier.torchvisionClassifier import GeneralizedTorchvisionClassifier
File "/home/aide/app/ai/models/detectron2/labels/torchvisionClassifier/init.py", line 6, in
from . import meta
File "/home/aide/app/ai/models/detectron2/labels/torchvisionClassifier/meta.py", line 7, in
from detectron2.data import MetadataCatalog
File "/opt/conda/lib/python3.8/site-packages/detectron2/data/init.py", line 4, in
from .build import (
File "/opt/conda/lib/python3.8/site-packages/detectron2/data/build.py", line 12, in
from detectron2.structures import BoxMode
File "/opt/conda/lib/python3.8/site-packages/detectron2/structures/init.py", line 7, in
from .masks import BitMasks, PolygonMasks, polygons_to_bitmask
File "/opt/conda/lib/python3.8/site-packages/detectron2/structures/masks.py", line 9, in
from detectron2.layers.roi_align import ROIAlign
File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/init.py", line 3, in
from .deform_conv import DeformConv, ModulatedDeformConv
File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/deform_conv.py", line 11, in
from detectron2 import _C
ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/celery/app/trace.py", line 450, in trace_task R = retval = fun(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/celery/app/trace.py", line 731, in __protected_call__ return self.run(args, **kwargs) File "/home/aide/app/modules/ModelMarketplace/backend/celery_interface.py", line 37, in import_model_uri return worker.importModelURI(project, username, modelURI, public, anonymous, forceReimport, namePolicy, customName) File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 491, in importModelURI return self.importModelFile(project, username, modelState, modelURI, public, anonymous, namePolicy, customName) File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 535, in importModelFile return self._import_model_state_file(project, fileName, modelState, stateDict, public, anonymous, namePolicy, customName) File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 375, in _import_model_state_file raise Exception(f'Model from imported state could not be launched (message: "{str(e)}").') Exception: Model from imported state could not be launched (message: "libc10_cuda.so: cannot open shared object file: No such file or directory"). File "/home/aide/app/modules/ModelMarketplace/backend/marketplaceWorker.py", line 375, in _import_model_state_file raise Exception(f'Model from imported state could not be launched (message: "{str(e)}").') Exception: Model from imported state could not be launched (message: "libc10_cuda.so: cannot open shared object file: No such file or directory").
Does anyone have any ideas about how to resolve this error? I tried installing CUDA10 directly into the docker container, but I hit driver version mismatch errors.
Many thanks, Matthew