Distributed workers stop reporting request stats

lolwhitaker commented 3 years ago

Hi. I am using locust4j 1.0.12 with locust master 1.4.3. My master is on a remote host from the workers. Frequently, I've noticed that when starting and stopping tests via the UI the workers stop reporting metrics to the master. This does not happen every time, I have seen it restart successfully too. But, I can get into this state pretty easily by just:

Bring up distributed worker
Start test in UI, let it run a few seconds.
Stop test in UI
Start new test in UI

Does not seem to be load related as it will happen even with very light load. Restarting the worker restores the stats again but the issue can be reproduced. Heartbeats appear not to be affected as the worker is reporting in its state and cpu stats without issue.

On another note, when I restart the worker, the worker count increases in the UI despite the Workers tab reflecting that the worker is missing (it also continues to report cpu stats for the missing worker). Presumably this is all because I am still using the same host when I restart the worker.

Let me know if I can provide additional details. I have been using tcpdump to inspect the traffic on port 5557 and its pretty clear that the heartbeat is consistently sent but the stat packets stop being sent. I can also see that the workers are actually executing the tasks just not reporting the stats for them.

lolwhitaker commented 3 years ago

I'll also note that when this happens, my worker quickly runs out of memory:

image (3)

lolwhitaker commented 3 years ago

fwiw, I have not yet been able to reproduce this running with a local master and worker on the same host.

myzhan commented 3 years ago

@lolwhitaker Thanks for reporting, when you reproduce this issue, can you get me a jstack sample? And if you used tcpdump to confirm no stat packets, I think the state of runner may be wrong, see aslo: https://github.com/myzhan/locust4j/blob/master/src/main/java/com/github/myzhan/locust4j/runtime/Runner.java#L385

lolwhitaker commented 3 years ago

Yea, I think you are right and the state is probably the issue.

I think I understand at least part of my issue now. Locust is only setup to handle the SIGTERM as far as I can tell: https://github.com/locustio/locust/blob/b000d5f229e6a0bf7b6f5472667b29e925371538/locust/main.py#L417-L422

If I kill -9 the master process and then restart it, the locust workers will never report back in to the new master. Using kill -15 locally I do see that they do properly deregister:

18:02:55.269 [locust4j-stats#0#stats-timer] stats.Stats [DEBUG] - Sleeping
18:02:58.228 [Thread-9receive-from-client] runtime.Runner [INFO] - Received message: stop
18:02:58.230 [Thread-9receive-from-client] runtime.Runner [DEBUG] - Recv stop message from master, all the workers are stopped
18:02:58.231 [Thread-9receive-from-client] runtime.Runner [INFO] - Received message: quit
18:02:58.231 [Thread-9receive-from-client] runtime.Runner [DEBUG] - Got quit message from master, shutting down...

I am using AWS to bring up and down my master and worker processes. As I understand it, when I terminate an instance in AWS it should be letting systemd gracefully shutdown the running processes (which sends a SIGTERM). When the new master comes up it is on a different host, but behind the same elb via the same dns in route53. They appear to register successfully with the new master to send heartbeats, but maybe some other state flag is preventing the stats? I haven't yet confirmed exactly what happens in the AWS space from the worker perspective. I will need to enable some debug flags on the worker and dig into this more to understand exactly what they get. It probably is specific to how I am running things.

myzhan commented 3 years ago

You can add some logs to trace the state of locust4j runner and the message sent by the master.

lolwhitaker commented 3 years ago

I never got to the bottom of this and don't think I ever will at this point. I am avoiding killing the master manually with kill -9. I think there still might be some edge case with the states lurking, but its probably a pretty hairy scenario to get there and better tracked in a simpler use case if someone is able to repro.

myzhan / locust4j

Distributed workers stop reporting request stats #28