panoptes / POCS

An Observatory Control System for the PANOPTES citizen-science project designed to help find transiting exoplanets! :telescope: :stars:
https://pocs.readthedocs.io/en/latest/
MIT License
77 stars 49 forks source link

test_messaging.py hangs (sometimes) #131

Closed jamessynge closed 6 years ago

jamessynge commented 6 years ago

Tests have been passing in general, but on this run test_messaging.py hung overnight.

$ pytest ========================================================== test session starts =========================================================== platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 rootdir: /home/james/git/panoptes/POCS, inifile: Downloading http://maia.usno.navy.mil/ser7/finals2000A.all |====================================================================================================| 3.1M/3.1M (100.00%) 0s collected 264 items

pocs/tests/test_base_scheduler.py ............... pocs/tests/test_camera.py ........s..ssssss..................ssssssssssssssssssssssssss pocs/tests/test_codestyle.py F pocs/tests/test_config.py ............. pocs/tests/test_constraints.py .............. pocs/tests/test_database.py ... pocs/tests/test_dispatch_scheduler.py ........ pocs/tests/test_dome_simulator.py ...F pocs/tests/test_field.py .... pocs/tests/test_focuser.py .......sssssss.. pocs/tests/test_images.py .......... pocs/tests/test_ioptron.py . pocs/tests/test_messaging.py ...

This is the tail of panoptes.log, log records from messaging.py:

MainProcess(6322) MainThread 20171125032826 UTC DEBUG messaging.py: 80:create_subscriber Creating subscriber. Port: 54321 Channel: MainProcess(6322) MainThread 20171125032826 UTC DEBUG messaging.py: 57:create_publisher Creating publisher. Binding to port 12345 Process-2(7093) MainThread 20171125032826 UTC DEBUG messaging.py: 80:create_subscriber Creating subscriber. Port: 12345 Channel: Process-2(7093) MainThread 20171125032826 UTC DEBUG messaging.py: 57:create_publisher Creating publisher. Binding to port 54321 MainProcess(6322) MainThread 20171125032828 UTC DEBUG messaging.py: 80:create_subscriber Creating subscriber. Port: 54321 Channel: MainProcess(6322) MainThread 20171125032828 UTC DEBUG messaging.py: 57:create_publisher Creating publisher. Binding to port 12345 Process-3(7104) MainThread 20171125032828 UTC DEBUG messaging.py: 80:create_subscriber Creating subscriber. Port: 12345 Channel: Process-3(7104) MainThread 20171125032828 UTC DEBUG messaging.py: 57:create_publisher Creating publisher. Binding to port 54321 MainProcess(6322) MainThread 20171125032830 UTC DEBUG messaging.py: 80:create_subscriber Creating subscriber. Port: 54321 Channel: MainProcess(6322) MainThread 20171125032830 UTC DEBUG messaging.py: 57:create_publisher Creating publisher. Binding to port 12345 Process-4(7113) MainThread 20171125032830 UTC DEBUG messaging.py: 80:create_subscriber Creating subscriber. Port: 12345 Channel: Process-4(7113) MainThread 20171125032830 UTC DEBUG messaging.py: 57:create_publisher Creating publisher. Binding to port 54321

jamessynge commented 6 years ago

Can't repro after many attempts.

wtgee commented 6 years ago

This suddenly is happening to me again. Can't quite identify when but am re-opening issue.

jamessynge commented 6 years ago

I'm guessing the test uses a fixed port. I'd there a way to switch to using a socketpair, or an ephemeral port?

On Dec 7, 2017 12:37 AM, "Wilfred Tyler Gee" notifications@github.com wrote:

This suddenly is happening to me again. Can't quite identify when but am re-opening issue.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/panoptes/POCS/issues/131#issuecomment-349868427, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYmU8ERRKiTJ-M8ynx6hfZabT7pjHyYks5s93mZgaJpZM4Qqd_z .

wtgee commented 6 years ago

Yes, it does use a fixed port and I agree that is the problem. I created the above Issue to address.

jamessynge commented 6 years ago

FYI, I can reproduce the failures ... irregularly. For example, test_send_datetime hangs on this line:

    msg_type, msg_obj = sub.receive_message()

I don't yet understand how PanMessaging works or is used, so will leave it to you (Wilfred) to debug.

wtgee commented 6 years ago

Thanks for info. I'll try to get to this soon.

jamessynge commented 6 years ago

FYI, this has hung several times in the last day during my testing. I'll see if I can add any more detail about why.

jamessynge commented 6 years ago

I notice that the failure pattern in all 3 tests within test_messaging.py is the same, i.e. they all fail when trying to receive the message sent on the previous line. This makes me wonder about zeromq's behavior w.r.t. store-and-forward or lack thereof. For example, is it the case that sometimes the receive_message call happens too late, and as a result the message has already been sent to the bit bucket? I'm going to test by adding a delay before all of those calls.

wtgee commented 6 years ago

Also, from manual:

There is one more important thing to know about PUB-SUB sockets: you do not know precisely when a subscriber starts to get messages. Even if you start a subscriber, wait a while, and then start the publisher, the subscriber will always miss the first messages that the publisher sends. This is because as the subscriber connects to the publisher (something that takes a small but non-zero time), the publisher may already be sending messages out.

Could be coming into play.

jamessynge commented 6 years ago

Wow! Given that "the subscriber will always miss the first" statement, how can we use this?

jamessynge commented 6 years ago

Maybe this is why you had included a sleep in the pub fixture?

wtgee commented 6 years ago

I want to say it is mitigated by the forwarder component, which is a little bit different from the default PUB-SUB pattern but I'd have to go through the manual again.

wtgee commented 6 years ago

Maybe this is why you had included a sleep in the pub fixture?

Right, I remember now. I was just looking for a sleep in messaging.py. The manual says you can do that although it is an inelegant solution and is probably what is causing our problems.

I think for the main messaging I assume the first message is never going to be critical in the long run, which is probably a bad assumption.