microsoft / aerial_wildlife_detection

Tools for detecting wildlife in aerial images using active learning
MIT License
230 stars 58 forks source link

AIDE.sh: line 10: 1647 Illegal instruction (core dumped) #44

Open frafra opened 3 years ago

frafra commented 3 years ago

I just run cd docker && docker-compose up on master.

aide_app_1  | Synchronizing state of redis-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
aide_app_1  | Executing: /lib/systemd/systemd-sysv-install enable redis-server
aide_app_1  | Starting redis-server: /etc/init.d/redis-server: 51: ulimit: error setting limit (Operation not permitted)
aide_app_1  | redis-server.
aide_app_1  | =============================
aide_app_1  | Setup of database IS STARTING
aide_app_1  | =============================
aide_app_1  |  * Restarting PostgreSQL 10 database server
aide_app_1  |    ...done.
aide_app_1  | CREATE ROLE
aide_app_1  | CREATE DATABASE
aide_app_1  | GRANT
aide_app_1  | CREATE EXTENSION
aide_app_1  | GRANT
aide_app_1  | Synchronizing state of postgresql.service with SysV service script with /lib/systemd/systemd-sysv-install.
aide_app_1  | Executing: /lib/systemd/systemd-sysv-install enable postgresql
aide_app_1  |  * Starting PostgreSQL 10 database server
aide_app_1  |    ...done.
aide_app_1  | ==============================
aide_app_1  | Setup of database IS COMPLETED
aide_app_1  | ==============================
aide_app_1  | 
aide_app_1  | ==========================
aide_app_1  | RABBITMQ SETUP IS STARTING
aide_app_1  | ==========================
aide_app_1  |  * Starting RabbitMQ Messaging Server rabbitmq-server
aide_app_1  |    ...done.
aide_app_1  | Creating user "aide"
aide_app_1  | Creating vhost "aide_vhost"
aide_app_1  | Setting permissions for user "aide" in vhost "aide_vhost"
aide_app_1  | Synchronizing state of rabbitmq-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
aide_app_1  | Executing: /lib/systemd/systemd-sysv-install enable rabbitmq-server
aide_app_1  | ===========================
aide_app_1  | RABBITMQ SETUP IS COMPLETED
aide_app_1  | ===========================
aide_app_1  | 
aide_app_1  | sysctl: setting key "net.ipv4.tcp_keepalive_time": Read-only file system
aide_app_1  | sysctl: setting key "net.ipv4.tcp_keepalive_intvl": Read-only file system
aide_app_1  | sysctl: setting key "net.ipv4.tcp_keepalive_probes": Read-only file system
aide_app_1  |  
aide_app_1  |  -------------- aide@aide_app_host v5.1.0 (sun-harmonics)
aide_app_1  | --- ***** ----- 
aide_app_1  | -- ******* ---- Linux-4.15.0-143-generic-x86_64-with-glibc2.10 2021-05-27 11:53:51
aide_app_1  | - *** --- * --- 
aide_app_1  | - ** ---------- [config]
aide_app_1  | - ** ---------- .> app:         AIDE:0x7fdb3c06abe0
aide_app_1  | - ** ---------- .> transport:   amqp://aide:**@localhost:5672/aide_vhost
aide_app_1  | - ** ---------- .> results:     redis://localhost:6379/0
aide_app_1  | - *** --- * --- .> concurrency: 4 (prefork)
aide_app_1  | -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
aide_app_1  | --- ***** ----- 
aide_app_1  |  -------------- [queues]
aide_app_1  |                 .> AIController     exchange=celery(direct) key=celery
aide_app_1  |                 .> AIWorker         exchange=celery(direct) key=celery
aide_app_1  |                 .> FileServer       exchange=celery(direct) key=celery
aide_app_1  |                 .> ModelMarketplace exchange=celery(direct) key=celery
aide_app_1  |                 .> aide@aide_app_host exchange=celery(direct) key=celery
aide_app_1  |                 .> bcast.a900bb2d-be20-474a-b1a2-b345bed7ece5 exchange=aide_broadcast(fanout) key=celery
aide_app_1  | 
aide_app_1  | AIDE.sh: line 10:  1647 Illegal instruction     (core dumped) python setup/assemble_server.py --migrate_db 1
aide_app_1  | Pre-flight checks failed; aborting launch of AIDE.
docker_aide_app_1 exited with code 0

The software seems to crash here: https://github.com/microsoft/aerial_wildlife_detection/blob/08954b5cf7855a0bdfcc0d02b5fc2ba1a25726c1/util/helpers.py#L100

path is set to ai.models.detectron2.AlexNet.

bkellenb commented 3 years ago

Hello Francesco,

Thanks for opening the issue. I was about to try and reproduce it but could not find any issue in my tests. ai.models.detectron2.AlexNet is a prediction model, not a ranker (AL criterion). It is strange that this gets registered under rankers, because the respective file clearly lists it correctly as a prediction model.

Is this a new Docker installation, or do you perhaps have an existing volume and project where, for some reason, an incorrect configuration entry has been made in the past?

frafra commented 3 years ago

Hi :) I updated my message. I wonder if that "illegal instruction" could be related to the lack of AVX support in my CPU, actually.

frafra commented 3 years ago

Disabling models allows me to run the container: https://github.com/microsoft/aerial_wildlife_detection/blob/f4d2862f3d46a8dff96cda4ae9cc339241b5c495/modules/AIController/backend/middleware.py#L38

bkellenb commented 3 years ago

Disabling models allows me to run the container:

This will bypass the import of available AI models. You will be able to use AIDE as a labeling tool, but not for training models (none will be visible). It might well be that PyTorch has a conflict on your machine (see below); in this case the reason why disabling models works is that no (Py-) Torch modules are imported this way.

I wonder if that "illegal instruction" could be related to the lack of AVX support in my CPU, actually.

This could be the cuplrit indeed. Browsing through PyTorch's issue tracker I see multiple mentions of this problem, with AVX2 instructions repeatedly finding their way into the code base.

I do not have a CPU at hand that does not support AVX (or AVX2) to try things out, but perhaps you can get it to work with a different version of PyTorch? Detectron2 lists 1.6.0 as the minimal version of PyTorch, but perhaps a newer one resolved this issue? This should be straightforward to check in a new Conda or virtualenv environment and by simply calling import torch from within a Python session.

vince7lf commented 2 years ago

Hi and nice to meet you.

FYI I ran into the same issue on a local VirtualBox Ubuntu 18.04 instance. It does not occur on my other 2 VM instances, I have a Ubuntu 18.04 instance provided by AllianceCan.ca (openstack), which is like my lab box #2, and a GCP instance which is the production box. My GCP & VBox instances have 4GB memory allocation, my AllianceCan.ca instance has 3GB.

Fault in util/helpers.py:get_class_executable, execFile = importlib.import_module(classPath) classPath is ai.models.detectron2. executableName is AlexNet

I did not investigate much, as that does not block me, and as long as it works in my other instances GCP and AllianceCan.ca. And anyway we do not use this module yet, we are only using the annotation interface. I disabled the AIController module (AIDE_MODULES).

I tried surrounding the statement between try / except but that's not an exception thrown. Answer is probably in this package 'ai.models.detectron2.AlexNet'.

The actual function that crashes is "return _bootstrap._gcd_import(name[level:], package, level)" in "def import_module(name, package=None):" but I cannot tell which package, probably the standard Python library.

Reproduced in Docker in the vbox instance using : FROM pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel

Or just simply in dev mode with Pycharm (remote debug to the local vbox instance) or directly with the command line on the vbox from a conda venv.

ssh://vince@127.0.0.1:7722/home/vince/anaconda3/envs/aide/bin/python3 -u /home/vince/.pycharm_helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 44833 --file setup/assemble_server.py --launch=1 --check_v1=0 --migrate_db=0 --force_migrate=0 --verbose=1

I have forked main/master version 2 in ~june 2021 and did not sync with latest changes (too many changes and conflicts). I have attached some screenshots where we can see the code, stacktrace, and AIDE startup screen, with error code 132.

Hope that helps a bit ! If you need anything let me know.

Best regards,

Vincent

2022-08-04 08_36_02-Window 2022-08-04 08_35_36- 2022-08-04 08_35_03-Window 2022-08-04 08_27_09-Window

vince7lf commented 1 year ago

I tested with the latest main branch (v2.0) on my VM Ubuntu 18.04 LTS Virtual Box (no CUDA/GPU) in remote debug mode with PyCharm (not docker), and same issue : when the AIController module is set, the error occurs.

vince7lf commented 1 year ago

I tested with the v3.0 branch and installed the not cuda version of torch and torchvision (torch==1.12.1 torchvision==0.13.1 vs torch==1.12.1+cu113 torchvision==0.13.1+cu113) on the same VM Ubuntu 18.04 LTS Virtual Box (no CUDA/GPU), in remote debug mode with PyCharm (not docker), and no issue, it starts well.