Closed MattSkiff closed 3 years ago
I figured this out but thought I would include it in the issues page in case this happens to anyone else (I hope that's ok!).
To solve this, I entered the docker image, force killed celery (only) (i.e. pkill -f celery
as per the documentation) and then committed the docker image i.e. docker commit image_name aide_app
(yes, this is not good practice). On the next launch of the docker image, this preventing the training loop from restarting.
Hi Matthew,
The fundamental issue behind this bug is that Celery does not always seem to listen to the revoke
command (which is issued when a training job is to be stopped). I suspect this is due to the long-running nature of the deep learning models. Unfortunately I have yet to find a solution to definitively stop running jobs, even if forcefully.
Thank you Ben, I don't have the experience to contribute on that front I'm afraid (only as a user).
Regarding this initial solution, the docker commit image_name aide_app
and restarting of the container is not necessary, the launch_celery.sh
script will reconnect the worker to allow the UI to be used to launch further workflows.
Is it possible to stop training without using the UI? The workflows manager shows no training in progress, yet the output from the project on the terminal shows the model is still training. I would like to force AIDE to stop training the model, yet when I kill or stop the image (either via docker or inside the running image using e.g.
./AIDE.sh stop
the training process keeps restarting.Do you know how this can be done?
Cheers, Matthew