icecube / skymap_scanner

A distributed system that performs a likelihood scan of event directions for IceCube real-time alerts using CPU cluster(s) and queue-based message passing.
5 stars 2 forks source link

Tests sometimes hang but then complete upon rerun #216

Closed tianluyuan closed 1 year ago

tianluyuan commented 1 year ago

It seems like sometimes millipede tests can hang, or at least run indefinitely. For an example see here

I think either the server crashed or the client doesn't return any results. The former is more likely as there isn't a consistent progress update. What's strange is the same test can be rerun and it will complete as expected.

tianluyuan commented 1 year ago

https://github.com/icecube/skymap_scanner/actions/runs/6186677592/job/16794960197#step:6:1

Looks like the server is not crashing but runs without registering the returned result from the client. And it seems like the clients do not continue to receive or return additional pixels.

tianluyuan commented 1 year ago

This is the partial log of a test that completes

2023-09-14 14:20:40 fv-az221-946 ewms-pilot[12] INFO TASK FINISHED -- attempting to send result message...
2023-09-14 14:20:40 fv-az221-946 mqclient[12] INFO Sending Message: 442 bytes
2023-09-14 14:20:40 fv-az221-946 ewms-pilot[12] INFO Now, attempting to ack original message...
2023-09-14 14:20:40 fv-az221-946 ewms-pilot[12] INFO 1 Tasks Finished
2023-09-14 14:20:40 fv-az221-946 mqclient[12] INFO Received Message: 77

This one doesn’t

2023-09-14 14:20:48 fv-az1128-620 ewms-pilot[12] INFO TASK FINISHED -- attempting to send result message...
2023-09-14 14:20:48 fv-az1128-620 mqclient[12] INFO Sending Message: 442 bytes
2023-09-14 14:46:35 fv-az1128-620 mqclient.rabbitmq[12] INFO [message_generator()] No messages in idle timeout window.

This links to the full failed log

tianluyuan commented 1 year ago

Closed by #211