Initalization hang for launch experiment under assistant/init

robinsonkwame commented 7 years ago

Under a local NextML, when having 6 unfinished experiments (using the PoolBasedBinaryClassification app) I observed an initialization hang (never completed) when attempting to launch the strange fruit experiment. No errors were observed in the NextML output.

The problem went away once I did docker rm $(docker ps -a -q)

dconathan commented 7 years ago

I also often finding myself having to remove containers to get experiments to launch again (docker-compose rm is a safer way because it will only remove NEXT containers). I'd like to get to the bottom of this but I don't know where to begin.

robinsonkwame commented 7 years ago

Maybe someone should make a test file that attempts to submit 50 experiments and verifies they all launched.

dconathan commented 7 years ago

In my experience is the problem is when you haven't removed the containers in a few days. I've had plenty of times when I've started 20+ experiments in quick succession.

I think the trick will be replicating it and then turning on some very verbose outputs to pinpoint the problem. That and removing the containers one at a time to try to see which one is causing it. I'll try to figure it out the next time it happens to me.

robinsonkwame commented 7 years ago

Note: I observed this behavior just now when re-launching 7 experiments that were automatically backed up from yesterday. I deleted each container one by one and only saw a fix when I deleted the local_rabbitmqredis_1 container. It's hard to say if local_rabbitmqredis_1 caused the issue because i was not able to first delete it (although the other containers were recreated). Yesterday I deleted the mongodb related containers and also saw the problem go away.

lalitkumarj commented 7 years ago

Interesting - can you try this with the CELERY_OFF flag set to true in Constants.py?

On Sat, Jun 17, 2017 at 11:20 AM, Kwame Porter Robinson < notifications@github.com> wrote:

Note: I observed this behavior just now when re-launching 7 experiments that were automatically backed up from yesterday. I deleted each container one by one and only saw a fix when I deleted the local_rabbitmqredis_1 container. It's hard to say if local_rabbitmqredis_1 caused the issue because i was not able to first delete it (although the other containers were recreated).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309221191, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGOjBqm-vjWY_yVqTkHdibp-XRgh0ks5sE-60gaJpZM4N87Ol .

erinzm commented 7 years ago

I've been running into this a lot, and I've figured out what's going on, but not why it happens. For some unknown reason, when launching NEXT, the Celery workers and NEXT don't establish a working connection. Since initExp requires a synchronous call to Celery, and we poll for the job result, initExp will hang indefinitely waiting for Celery.

If NEXT launched successfully, you don't need to worry about this. It's immediately apparent when Celery mislaunched, as any API request will hang.

Liam

On Sat, Jun 17, 2017 at 10:24 AM, Lalit Jain notifications@github.com wrote:

Interesting - can you try this with the CELERY_OFF flag set to true in Constants.py?

On Sat, Jun 17, 2017 at 11:20 AM, Kwame Porter Robinson < notifications@github.com> wrote:

Note: I observed this behavior just now when re-launching 7 experiments that were automatically backed up from yesterday. I deleted each container one by one and only saw a fix when I deleted the local_rabbitmqredis_1 container. It's hard to say if local_rabbitmqredis_1 caused the issue because i was not able to first delete it (although the other containers were recreated).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309221191, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGOjBqm-vjWY_ yVqTkHdibp-XRgh0ks5sE-60gaJpZM4N87Ol .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309221425, or mute the thread https://github.com/notifications/unsubscribe-auth/AEa8SXv5KELignNofT0aapoOIJQTO8Qpks5sE--3gaJpZM4N87Ol .

dconathan commented 7 years ago

If it's a Celery problem, maybe it'll go away if we update to the most recent version of Celery?

robinsonkwame commented 7 years ago

re: Celery, it looks like there have been worker hanging problems as late as 2016, see: https://github.com/celery/celery/issues/1847 as possibly being related.

dconathan commented 7 years ago

celery/celery#1960 seems to suggest increasing number of workers

dconathan commented 7 years ago

Facing the issue this morning. Played with # of workers/prefetch/etc. and tried various inspect commands and still nothing. inspect is showing queues to be empty.

Everything runs fine with CELERY_ON = False

dconathan commented 7 years ago

Tried a bunch of things to see if I could isolate it to rabbitmq but couldn't. I'm going to switch to my celery 4.0 branch and see if I can reproduce it.

lalitkumarj commented 7 years ago

Is there any chance it is the celery version? Also did you try turning Celery off? I am kind of curious to know if that does anything.

On Mon, Jun 19, 2017 at 11:27 AM, dconathan notifications@github.com wrote:

Tried a bunch of things to see if I could isolate it to rabbitmq but couldn't. I'm going to switch to my celery 4.0 branch and see if I can reproduce it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309475318, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGHaak1A3WzkzGD5WrVO4XD5J1BrSks5sFpNYgaJpZM4N87Ol .

dconathan commented 7 years ago

@lalitkumarj yeah I tried it with celery off and was able to launch an experiment.

dconathan commented 7 years ago

Okay so I rebuilt using celery 4.0 and was still seeing the issue. Then I removed/recreated the rabbitmqredis and that fixed it! Next time it happens I'll try reproducing this, but this definitely helps narrow down the problem if true...

lalitkumarj commented 7 years ago

Thanks for the links to the Celery issues, I'll play a bit with this today as well. Can we just freeze the celery version (as it used to be)?

On Mon, Jun 19, 2017 at 11:35 AM, dconathan notifications@github.com wrote:

Okay so I rebuilt using celery 4.0 and was still seeing the issue. Then I removed/recreated the rabbitmqredis and that fixed it! Next time it happens I'll try reproducing this, but this definitely helps narrow down the problem if true...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextml/NEXT/issues/193#issuecomment-309477829, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWhGDHsEEn_GVyzcyn5GeCP8zDmwzWMks5sFpVIgaJpZM4N87Ol .

dconathan commented 7 years ago

@lalitkumarj it is frozen https://github.com/nextml/NEXT/blob/master/next/base_docker_image/requirements.txt#L7

dconathan commented 7 years ago

It happened again and removing/recreating the rabbitmqredis container fixed it again. I would say it has to be that we're getting caught in one of the following while loops:

https://github.com/nextml/NEXT/blob/master/next/broker/broker.py#L115-L133

https://github.com/nextml/NEXT/blob/master/next/broker/broker.py#L176-L194

These are literally the only two places that the rabbitmqredis is used.

dconathan commented 7 years ago

Debugging this again this morning. It's not the while loops themselves causing the problem. I think I nailed down the cause.

The problem is that every so often (I guess) docker changes the docker id for containers (and thus the hostname). rabbitmqredis caches the hostname in its database:

https://github.com/nextml/NEXT/blob/master/next/broker/broker.py#L189

So every so often the hostname changes but because of the cache, the broker keeps trying to hit the old hostname.

The obvious solution is to set the key to expire. Grabbing the hostname from disk is not terribly expensive. Problem is that the __get_domain_for_job function fails when the minionworker itself runs it, because the worker doesn't have MINIONWORKER anywhere in its /etc/hosts.

I think the solution is to put this at the end:

if self.hostname is None:
    import socket
    self.hostname = socket.gethostname()

Which seems to grab the correct hostname when the minionworker is calling this function.

I'll put together a pull request that does all this - we will want to test this a bunch to make sure socket.gethostname() is indeed doing what we want (e.g. is this something where different versions of docker would screw us over?)

dconathan commented 7 years ago

Also, I have no idea about the whole "multiple worker nodes" described in the doc string of that function. Is this still supported? Is this something that this socket.gethostname() will break?

dconathan commented 7 years ago

Branch that fixes this here: https://github.com/nextml/NEXT/compare/fix_minionworker_hostname

Tests pass on my up-to-date images at work. I can't test the normal ones here though so I'm holding off on the pull request.

nextml / NEXT

Initalization hang for launch experiment under assistant/init #193