Closed robinsonkwame closed 7 years ago
I also often finding myself having to remove containers to get experiments to launch again (docker-compose rm
is a safer way because it will only remove NEXT containers). I'd like to get to the bottom of this but I don't know where to begin.
Maybe someone should make a test file that attempts to submit 50 experiments and verifies they all launched.
In my experience is the problem is when you haven't removed the containers in a few days. I've had plenty of times when I've started 20+ experiments in quick succession.
I think the trick will be replicating it and then turning on some very verbose outputs to pinpoint the problem. That and removing the containers one at a time to try to see which one is causing it. I'll try to figure it out the next time it happens to me.
Note: I observed this behavior just now when re-launching 7 experiments that were automatically backed up from yesterday. I deleted each container one by one and only saw a fix when I deleted the local_rabbitmqredis_1
container. It's hard to say if local_rabbitmqredis_1
caused the issue because i was not able to first delete it (although the other containers were recreated). Yesterday I deleted the mongodb related containers and also saw the problem go away.
Interesting - can you try this with the CELERY_OFF flag set to true in Constants.py?
On Sat, Jun 17, 2017 at 11:20 AM, Kwame Porter Robinson < notifications@github.com> wrote:
Note: I observed this behavior just now when re-launching 7 experiments that were automatically backed up from yesterday. I deleted each container one by one and only saw a fix when I deleted the local_rabbitmqredis_1 container. It's hard to say if local_rabbitmqredis_1 caused the issue because i was not able to first delete it (although the other containers were recreated).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309221191, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGOjBqm-vjWY_yVqTkHdibp-XRgh0ks5sE-60gaJpZM4N87Ol .
I've been running into this a lot, and I've figured out what's going on, but not why it happens. For some unknown reason, when launching NEXT, the Celery workers and NEXT don't establish a working connection. Since initExp requires a synchronous call to Celery, and we poll for the job result, initExp will hang indefinitely waiting for Celery.
If NEXT launched successfully, you don't need to worry about this. It's immediately apparent when Celery mislaunched, as any API request will hang.
Liam
On Sat, Jun 17, 2017 at 10:24 AM, Lalit Jain notifications@github.com wrote:
Interesting - can you try this with the CELERY_OFF flag set to true in Constants.py?
On Sat, Jun 17, 2017 at 11:20 AM, Kwame Porter Robinson < notifications@github.com> wrote:
Note: I observed this behavior just now when re-launching 7 experiments that were automatically backed up from yesterday. I deleted each container one by one and only saw a fix when I deleted the local_rabbitmqredis_1 container. It's hard to say if local_rabbitmqredis_1 caused the issue because i was not able to first delete it (although the other containers were recreated).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309221191, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGOjBqm-vjWY_ yVqTkHdibp-XRgh0ks5sE-60gaJpZM4N87Ol .
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309221425, or mute the thread https://github.com/notifications/unsubscribe-auth/AEa8SXv5KELignNofT0aapoOIJQTO8Qpks5sE--3gaJpZM4N87Ol .
If it's a Celery problem, maybe it'll go away if we update to the most recent version of Celery?
re: Celery, it looks like there have been worker hanging problems as late as 2016, see: https://github.com/celery/celery/issues/1847 as possibly being related.
celery/celery#1960 seems to suggest increasing number of workers
Facing the issue this morning. Played with # of workers/prefetch/etc. and tried various inspect
commands and still nothing. inspect
is showing queues to be empty.
Everything runs fine with CELERY_ON = False
Tried a bunch of things to see if I could isolate it to rabbitmq but couldn't. I'm going to switch to my celery 4.0 branch and see if I can reproduce it.
Is there any chance it is the celery version? Also did you try turning Celery off? I am kind of curious to know if that does anything.
On Mon, Jun 19, 2017 at 11:27 AM, dconathan notifications@github.com wrote:
Tried a bunch of things to see if I could isolate it to rabbitmq but couldn't. I'm going to switch to my celery 4.0 branch and see if I can reproduce it.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309475318, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGHaak1A3WzkzGD5WrVO4XD5J1BrSks5sFpNYgaJpZM4N87Ol .
@lalitkumarj yeah I tried it with celery off and was able to launch an experiment.
Okay so I rebuilt using celery 4.0 and was still seeing the issue. Then I removed/recreated the rabbitmqredis and that fixed it! Next time it happens I'll try reproducing this, but this definitely helps narrow down the problem if true...
Thanks for the links to the Celery issues, I'll play a bit with this today as well. Can we just freeze the celery version (as it used to be)?
On Mon, Jun 19, 2017 at 11:35 AM, dconathan notifications@github.com wrote:
Okay so I rebuilt using celery 4.0 and was still seeing the issue. Then I removed/recreated the rabbitmqredis and that fixed it! Next time it happens I'll try reproducing this, but this definitely helps narrow down the problem if true...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309477829, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGDHsEEn_GVyzcyn5GeCP8zDmwzWMks5sFpVIgaJpZM4N87Ol .
@lalitkumarj it is frozen https://github.com/nextml/NEXT/blob/master/next/base_docker_image/requirements.txt#L7
It happened again and removing/recreating the rabbitmqredis
container fixed it again. I would say it has to be that we're getting caught in one of the following while
loops:
https://github.com/nextml/NEXT/blob/master/next/broker/broker.py#L115-L133
https://github.com/nextml/NEXT/blob/master/next/broker/broker.py#L176-L194
These are literally the only two places that the rabbitmqredis is used.
Debugging this again this morning. It's not the while loops themselves causing the problem. I think I nailed down the cause.
The problem is that every so often (I guess) docker changes the docker id for containers (and thus the hostname). rabbitmqredis
caches the hostname in its database:
https://github.com/nextml/NEXT/blob/master/next/broker/broker.py#L189
So every so often the hostname changes but because of the cache, the broker keeps trying to hit the old hostname.
The obvious solution is to set the key to expire. Grabbing the hostname from disk is not terribly expensive. Problem is that the __get_domain_for_job
function fails when the minionworker
itself runs it, because the worker doesn't have MINIONWORKER
anywhere in its /etc/hosts
.
I think the solution is to put this at the end:
if self.hostname is None:
import socket
self.hostname = socket.gethostname()
Which seems to grab the correct hostname when the minionworker is calling this function.
I'll put together a pull request that does all this - we will want to test this a bunch to make sure socket.gethostname()
is indeed doing what we want (e.g. is this something where different versions of docker would screw us over?)
Also, I have no idea about the whole "multiple worker nodes" described in the doc string of that function. Is this still supported? Is this something that this socket.gethostname()
will break?
Branch that fixes this here: https://github.com/nextml/NEXT/compare/fix_minionworker_hostname
Tests pass on my up-to-date images at work. I can't test the normal ones here though so I'm holding off on the pull request.
Under a local NextML, when having 6 unfinished experiments (using the PoolBasedBinaryClassification app) I observed an initialization hang (never completed) when attempting to launch the
strange fruit experiment
. No errors were observed in the NextML output.The problem went away once I did
docker rm $(docker ps -a -q)