Open frafra opened 3 years ago
Hello Francesco,
Thanks for opening the issue. I was about to try and reproduce it but could not find any issue in my tests.
ai.models.detectron2.AlexNet
is a prediction model, not a ranker (AL criterion). It is strange that this gets registered under rankers
, because the respective file clearly lists it correctly as a prediction model.
Is this a new Docker installation, or do you perhaps have an existing volume and project where, for some reason, an incorrect configuration entry has been made in the past?
Hi :) I updated my message. I wonder if that "illegal instruction" could be related to the lack of AVX support in my CPU, actually.
Disabling models allows me to run the container: https://github.com/microsoft/aerial_wildlife_detection/blob/f4d2862f3d46a8dff96cda4ae9cc339241b5c495/modules/AIController/backend/middleware.py#L38
Disabling models allows me to run the container:
This will bypass the import of available AI models. You will be able to use AIDE as a labeling tool, but not for training models (none will be visible). It might well be that PyTorch has a conflict on your machine (see below); in this case the reason why disabling models works is that no (Py-) Torch modules are imported this way.
I wonder if that "illegal instruction" could be related to the lack of AVX support in my CPU, actually.
This could be the cuplrit indeed. Browsing through PyTorch's issue tracker I see multiple mentions of this problem, with AVX2 instructions repeatedly finding their way into the code base.
I do not have a CPU at hand that does not support AVX (or AVX2) to try things out, but perhaps you can get it to work with a different version of PyTorch? Detectron2 lists 1.6.0 as the minimal version of PyTorch, but perhaps a newer one resolved this issue? This should be straightforward to check in a new Conda or virtualenv environment and by simply calling import torch
from within a Python session.
Hi and nice to meet you.
FYI I ran into the same issue on a local VirtualBox Ubuntu 18.04 instance. It does not occur on my other 2 VM instances, I have a Ubuntu 18.04 instance provided by AllianceCan.ca (openstack), which is like my lab box #2, and a GCP instance which is the production box. My GCP & VBox instances have 4GB memory allocation, my AllianceCan.ca instance has 3GB.
Fault in util/helpers.py:get_class_executable, execFile = importlib.import_module(classPath) classPath is ai.models.detectron2. executableName is AlexNet
I did not investigate much, as that does not block me, and as long as it works in my other instances GCP and AllianceCan.ca. And anyway we do not use this module yet, we are only using the annotation interface. I disabled the AIController module (AIDE_MODULES).
I tried surrounding the statement between try / except but that's not an exception thrown. Answer is probably in this package 'ai.models.detectron2.AlexNet'.
The actual function that crashes is "return _bootstrap._gcd_import(name[level:], package, level)" in "def import_module(name, package=None):" but I cannot tell which package, probably the standard Python library.
Reproduced in Docker in the vbox instance using : FROM pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel
Or just simply in dev mode with Pycharm (remote debug to the local vbox instance) or directly with the command line on the vbox from a conda venv.
ssh://vince@127.0.0.1:7722/home/vince/anaconda3/envs/aide/bin/python3 -u /home/vince/.pycharm_helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 44833 --file setup/assemble_server.py --launch=1 --check_v1=0 --migrate_db=0 --force_migrate=0 --verbose=1
I have forked main/master version 2 in ~june 2021 and did not sync with latest changes (too many changes and conflicts). I have attached some screenshots where we can see the code, stacktrace, and AIDE startup screen, with error code 132.
Hope that helps a bit ! If you need anything let me know.
Best regards,
Vincent
I tested with the latest main branch (v2.0) on my VM Ubuntu 18.04 LTS Virtual Box (no CUDA/GPU) in remote debug mode with PyCharm (not docker), and same issue : when the AIController module is set, the error occurs.
I tested with the v3.0 branch and installed the not cuda version of torch and torchvision (torch==1.12.1 torchvision==0.13.1 vs torch==1.12.1+cu113 torchvision==0.13.1+cu113) on the same VM Ubuntu 18.04 LTS Virtual Box (no CUDA/GPU), in remote debug mode with PyCharm (not docker), and no issue, it starts well.
I just run
cd docker && docker-compose up
on master.The software seems to crash here: https://github.com/microsoft/aerial_wildlife_detection/blob/08954b5cf7855a0bdfcc0d02b5fc2ba1a25726c1/util/helpers.py#L100
path
is set toai.models.detectron2.AlexNet
.