microsoft / aerial_wildlife_detection

Tools for detecting wildlife in aerial images using active learning
MIT License
230 stars 58 forks source link

Manually stop training without using UI #50

Closed MattSkiff closed 3 years ago

MattSkiff commented 3 years ago

Is it possible to stop training without using the UI? The workflows manager shows no training in progress, yet the output from the project on the terminal shows the model is still training. I would like to force AIDE to stop training the model, yet when I kill or stop the image (either via docker or inside the running image using e.g. ./AIDE.sh stop the training process keeps restarting.

Do you know how this can be done?

Cheers, Matthew

MattSkiff commented 3 years ago

I figured this out but thought I would include it in the issues page in case this happens to anyone else (I hope that's ok!).

To solve this, I entered the docker image, force killed celery (only) (i.e. pkill -f celery as per the documentation) and then committed the docker image i.e. docker commit image_name aide_app (yes, this is not good practice). On the next launch of the docker image, this preventing the training loop from restarting.

bkellenb commented 3 years ago

Hi Matthew,

The fundamental issue behind this bug is that Celery does not always seem to listen to the revoke command (which is issued when a training job is to be stopped). I suspect this is due to the long-running nature of the deep learning models. Unfortunately I have yet to find a solution to definitively stop running jobs, even if forcefully.

MattSkiff commented 3 years ago

Thank you Ben, I don't have the experience to contribute on that front I'm afraid (only as a user).

Regarding this initial solution, the docker commit image_name aide_app and restarting of the container is not necessary, the launch_celery.sh script will reconnect the worker to allow the UI to be used to launch further workflows.