Worker stuck in loop on startup

moovoo commented 3 years ago

locust4j: 1.0.12 master: 1.2 or 1.4, looks like it doesn't really matter

scenario:

Deploy locust master in k8s cluster
Deploy locust4j workers in k8s cluster
Go to locust master workers page

expected results: number of available workers equals number of deployed workers actual results: sometimes a few workers are missing on the page, while it actually runs and consumes 100% CPU

It happens from time to time, looks like it depends on k8s node performance (too slow or too few cores available) and cause race condition while startup

Stack trace of the main thread stuck in that cycle forever (other threads are waiting):

"main" #1 prio=5 os_prio=0 cpu=545997.96ms elapsed=582.88s allocated=13858K defined_classes=1064 tid=0x00007f65d8028800 nid=0xd runnable [0x00007f65debd5000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcherImpl.read0(java.base@14.0.2/Native Method) at sun.nio.ch.FileDispatcherImpl.read(java.base@14.0.2/FileDispatcherImpl.java:48) at sun.nio.ch.IOUtil.readIntoNativeBuffer(java.base@14.0.2/IOUtil.java:276) at sun.nio.ch.IOUtil.read(java.base@14.0.2/IOUtil.java:245) at sun.nio.ch.IOUtil.read(java.base@14.0.2/IOUtil.java:223) at sun.nio.ch.SourceChannelImpl.read(java.base@14.0.2/SourceChannelImpl.java:277) at zmq.Signaler.recv(Signaler.java:159) at zmq.Mailbox.recv(Mailbox.java:97) at zmq.SocketBase.processCommands(SocketBase.java:931) at zmq.SocketBase.send(SocketBase.java:696) at org.zeromq.ZMQ$Socket.send(ZMQ.java:2588) at org.zeromq.ZMQ$Socket.send(ZMQ.java:2577) at com.github.myzhan.locust4j.rpc.ZeromqClient.send(ZeromqClient.java:44) at com.github.myzhan.locust4j.runtime.Runner.getReady(Runner.java:340) at com.github.myzhan.locust4j.Locust.run(Locust.java:179)

locked <0x00000000d5f82848> (a com.github.myzhan.locust4j.Locust) at com.github.myzhan.locust4j.Locust.run(Locust.java:154) at com.sampleworker.App.main(App.java:82)

Basically the worker can't send ready status to master, though it actually established a connection.

speculation Not entirely sure, but my guess that the race condition happen here:

        this.executor.submit(new Receiver(this));
        try {
            this.rpcClient.send(new Message("client_ready", null, this.nodeID));
        } catch (IOException ex) {
            logger.error("Error while sending a message that the system is ready", ex);
        }

Receiver thread tries to process incoming commands from the socket (some messages from master)
rpcClient.send under the hood also checks if there something else to read from the socket, see zmq.SocketBase.send(SocketBase.java:696):
```
    //  Process pending commands, if any.
    boolean brc = processCommands(0, true);
    if (!brc) {
        return false;
    }
```
Probably need to start receiver thread after client_ready cmd, or use some sort of lock on client

myzhan commented 3 years ago

I see no race condition from the thread stack. If some workers are missiong on the web page, you can check the logs of locust master, maybe the workers don't sent heartbeat request as fast as expect, because of the k8s node performance, then they are removed by the master.

moovoo commented 3 years ago

I checked the master, it does not consider worker as alive because problematic workers never send 'client_ready' or 'heartbeat' messages. They never send it because worker's main thread stuck in a loop inside this.rpcClient.send(new Message("client_ready", null, this.nodeID));, more specifically in zmq.Signaler.recv. (see how much time it is stuck there)

Here are some captures. Good situation:

    00000000  ff 00 00 00 00 00 00 00  01 7f                     ........ ..
00000000  ff 00 00 00 00 00 00 00  40 7f 03                  ........ @..
    0000000A  03 01 4e 55 4c 4c 00 00  00 00 00 00 00 00 00 00   ..NULL.. ........
    0000001A  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    0000002A  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    0000003A  00 00 00 00 00 00                                  ......
0000000B  00 4e 55 4c 4c 00 00 00  00 00 00 00 00 00 00 00   .NULL... ........
0000001B  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
0000002B  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
0000003B  00 00 00 00 00                                     .....
    00000040  04 29 05 52 45 41 44 59  0b 53 6f 63 6b 65 74 2d   .).READY .Socket-
    00000050  54 79 70 65 00 00 00 06  52 4f 55 54 45 52 08 49   Type.... ROUTER.I
    00000060  64 65 6e 74 69 74 79 00  00 00 00                  dentity. ...
00000040  04 68 05 52 45 41 44 59  0b 53 6f 63 6b 65 74 2d   .h.READY .Socket-
00000050  54 79 70 65 00 00 00 06  44 45 41 4c 45 52 08 49   Type.... DEALER.I
00000060  64 65 6e 74 69 74 79 00  00 00 3f 6c 6f 63 75 73   dentity. ..?locus
00000070  74 2d 77 6f 72 6b 65 72  2d 36 37 36 38 39 36 39   t-worker -6768969
00000080  63 62 37 2d 74 36 32 70  6b 5f 38 63 39 30 61 39   cb7-t62p k_8c90a9
00000090  39 33 31 65 64 34 36 37  62 62 61 61 65 62 31 66   931ed467 bbaaeb1f
000000A0  37 33 38 62 36 38 62 30  34 36                     738b68b0 46
000000AA  00 50 93 ac 63 6c 69 65  6e 74 5f 72 65 61 64 79   .P..clie nt_ready
000000BA  c0 d9 3f 6c 6f 63 75 73  74 2d 77 6f 72 6b 65 72   ..?locus t-worker
000000CA  2d 36 37 36 38 39 36 39  63 62 37 2d 74 36 32 70   -6768969 cb7-t62p
000000DA  6b 5f 38 63 39 30 61 39  39 33 31 65 64 34 36 37   k_8c90a9 931ed467
000000EA  62 62 61 61 65 62 31 66  37 33 38 62 36 38 62 30   bbaaeb1f 738b68b0
000000FA  34 36                                              46
000000FC  00 74 93 a9 68 65 61 72  74 62 65 61 74 82 a5 73   .t..hear tbeat..s
0000010C  74 61 74 65 a5 72 65 61  64 79 b1 63 75 72 72 65   tate.rea dy.curre
0000011C  6e 74 5f 63 70 75 5f 75  73 61 67 65 cb 00 00 00   nt_cpu_u sage....
0000012C  00 00 00 00 00 d9 3f 6c  6f 63 75 73 74 2d 77 6f   ......?l ocust-wo
0000013C  72 6b 65 72 2d 36 37 36  38 39 36 39 63 62 37 2d   rker-676 8969cb7-
0000014C  74 36 32 70 6b 5f 38 63  39 30 61 39 39 33 31 65   t62pk_8c 90a9931e
0000015C  64 34 36 37 62 62 61 61  65 62 31 66 37 33 38 62   d467bbaa eb1f738b
0000016C  36 38 62 30 34 36                                  68b046
00000172  00 74 93 a9 68 65 61 72  74 62 65 61 74 82 a5 73   .t..hear tbeat..s
00000182  74 61 74 65 a5 72 65 61  64 79 b1 63 75 72 72 65   tate.rea dy.curre
00000192  6e 74 5f 63 70 75 5f 75  73 61 67 65 cb 40 10 29   nt_cpu_u sage.@.)
000001A2  5f ad 40 a5 7f d9 3f 6c  6f 63 75 73 74 2d 77 6f   _.@...?l ocust-wo
000001B2  72 6b 65 72 2d 36 37 36  38 39 36 39 63 62 37 2d   rker-676 8969cb7-
000001C2  74 36 32 70 6b 5f 38 63  39 30 61 39 39 33 31 65   t62pk_8c 90a9931e
000001D2  64 34 36 37 62 62 61 61  65 62 31 66 37 33 38 62   d467bbaa eb1f738b
000001E2  36 38 62 30 34 36                                  68b046

They establish connection, exchange identities (router/dealer) and the worker start to send heartbeats from now on.

Bad situation:

    00000000  ff 00 00 00 00 00 00 00  01 7f                     ........ ..
00000000  ff 00 00 00 00 00 00 00  40 7f 03                  ........ @..
    0000000A  03 01 4e 55 4c 4c 00 00  00 00 00 00 00 00 00 00   ..NULL.. ........
    0000001A  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    0000002A  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    0000003A  00 00 00 00 00 00                                  ......
0000000B  00 4e 55 4c 4c 00 00 00  00 00 00 00 00 00 00 00   .NULL... ........
0000001B  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
0000002B  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
0000003B  00 00 00 00 00                                     .....
    00000040  04 29 05 52 45 41 44 59  0b 53 6f 63 6b 65 74 2d   .).READY .Socket-
    00000050  54 79 70 65 00 00 00 06  52 4f 55 54 45 52 08 49   Type.... ROUTER.I
    00000060  64 65 6e 74 69 74 79 00  00 00 00                  dentity. ...
00000040  04 68 05 52 45 41 44 59  0b 53 6f 63 6b 65 74 2d   .h.READY .Socket-
00000050  54 79 70 65 00 00 00 06  44 45 41 4c 45 52 08 49   Type.... DEALER.I
00000060  64 65 6e 74 69 74 79 00  00 00 3f 6c 6f 63 75 73   dentity. ..?locus
00000070  74 2d 77 6f 72 6b 65 72  2d 36 37 36 38 39 36 39   t-worker -6768969
00000080  63 62 37 2d 63 36 6a 71  74 5f 39 36 65 35 30 36   cb7-c6jq t_96e506
00000090  36 35 38 36 66 36 30 32  37 37 30 36 66 64 64 63   6586f602 7706fddc
000000A0  37 66 30 65 39 30 62 63  32 33                     7f0e90bc 23

and that's it, worker never send client_ready/heartbeat.

My guess is that server message with Socket-type ROUTER/identity get processed by two threads simultaneously: Receiver thread and main thread. Usually it never happen on fast enough machines because Receiver thread manages to process it before main thread fall into zmq.SocketBase.processCommands check. But on a busy machine main thread might accidentally go into that check and find that there something to read from socket, while Receiver thread flushes all buffers in meantime.

myzhan commented 3 years ago

When you reproduce this issue, are you sure the locust master is reachable? I think a dealer socket will connect to the master synchronously, that will block the main thread to send client_ready. And I have added some logs to debug https://github.com/myzhan/locust4j/commit/166e152a45dcd118648c06c80d3d7e80de8e5235 , you can pull the master branch to build.

BTW, it will be better if you can provide a pcap file.

moovoo commented 3 years ago

It always goes beyond connection point, so I always see that 'connected to ...' message in all cases, no connectivity issues. I patched zmq lib a bit around processCommands with additional logging.

Good case:

2021-02-26 11:17:19.768 [main] DEBUG kek - hello world
2021-02-26 11:17:22.176 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.177 [main] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.258 [main] DEBUG com.github.myzhan.locust4j.rpc.ZeromqClient - Locust4j is connected to master(locust-master:5557)
2021-02-26 11:17:22.267 [Thread-2receive-from-client] DEBUG com.github.myzhan.locust4j.runtime.Runner - Receiver started
2021-02-26 11:17:22.269 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.306 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.307 [main] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.456 [Thread-2receive-from-client] DEBUG Socketbase - Processing cmd Cmd[Dealer[1], 3, OWN, SessionBase-Dealer[1]]
2021-02-26 11:17:22.457 [Thread-2receive-from-client] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.458 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:23.495 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:23.497 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:24.507 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:24.507 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:25.515 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:25.515 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:26.518 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:26.518 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:27.522 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:27.522 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:28.535 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:28.536 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:29.538 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:29.538 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands 
...

Bad case:

2021-02-26 11:17:19.742 [main] DEBUG kek - hello world
2021-02-26 11:17:22.277 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.277 [main] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.326 [main] DEBUG com.github.myzhan.locust4j.rpc.ZeromqClient - Locust4j is connected to master(locust-master:5557)
2021-02-26 11:17:22.427 [Thread-2receive-from-client] DEBUG com.github.myzhan.locust4j.runtime.Runner - Receiver started
2021-02-26 11:17:22.472 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.471 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.547 [Thread-2receive-from-client] DEBUG Socketbase - Processing cmd Cmd[Dealer[1], 3, OWN, SessionBase-Dealer[1]]
2021-02-26 11:17:22.548 [Thread-2receive-from-client] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.548 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands

main and receiver enter processCommands almost simultaneously, but sometimes main thread is not lucky enough.

Here is the full stack trace for all threads:

# jstack -l -e 1
2021-02-26 11:23:43
Full thread dump OpenJDK 64-Bit Server VM (14.0.2+12 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007fcf14002880, length=17, elements={
0x00007fcf44028800, 0x00007fcf44062000, 0x00007fcf44064000, 0x00007fcf4406b000,
0x00007fcf4406d000, 0x00007fcf4406f000, 0x00007fcf44071000, 0x00007fcf44073000,
0x00007fcf440b5800, 0x00007fcf440be800, 0x00007fcf442b4000, 0x00007fcf442bb800,
0x00007fcf442bd000, 0x00007fcf442de000, 0x00007fcf442ed800, 0x00007fcf44309800,
0x00007fcf14001000
}

"main" #1 prio=5 os_prio=0 cpu=349343.71ms elapsed=384.21s allocated=13869K defined_classes=1083 tid=0x00007fcf44028800 nid=0xd runnable  [0x00007fcf49970000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.read0(java.base@14.0.2/Native Method)
        at sun.nio.ch.FileDispatcherImpl.read(java.base@14.0.2/FileDispatcherImpl.java:48)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(java.base@14.0.2/IOUtil.java:276)
        at sun.nio.ch.IOUtil.read(java.base@14.0.2/IOUtil.java:245)
        at sun.nio.ch.IOUtil.read(java.base@14.0.2/IOUtil.java:223)
        at sun.nio.ch.SourceChannelImpl.read(java.base@14.0.2/SourceChannelImpl.java:277)
        at zmq.Signaler.recv(Signaler.java:159)
        at zmq.Mailbox.recv(Mailbox.java:97)
        at zmq.SocketBase.processCommands(SocketBase.java:935)
        at zmq.SocketBase.send(SocketBase.java:699)
        at org.zeromq.ZMQ$Socket.send(ZMQ.java:2588)
        at org.zeromq.ZMQ$Socket.send(ZMQ.java:2577)
        at com.github.myzhan.locust4j.rpc.ZeromqClient.send(ZeromqClient.java:48)
        at com.github.myzhan.locust4j.runtime.Runner.getReady(Runner.java:340)
        at com.github.myzhan.locust4j.Locust.run(Locust.java:183)
        - locked <0x00000000a0144b10> (a com.github.myzhan.locust4j.Locust)
        at com.github.myzhan.locust4j.Locust.run(Locust.java:158)
        at com.sampleworker.App.main(App.java:84)

   Locked ownable synchronizers:
        - <0x00000000a0e07580> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.18ms elapsed=384.18s allocated=0B defined_classes=0 tid=0x00007fcf44062000 nid=0xf waiting on condition  [0x00007fcf2bbe1000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.ref.Reference.waitForReferencePendingList(java.base@14.0.2/Native Method)
        at java.lang.ref.Reference.processPendingReferences(java.base@14.0.2/Reference.java:241)
        at java.lang.ref.Reference$ReferenceHandler.run(java.base@14.0.2/Reference.java:213)

   Locked ownable synchronizers:
        - None

"Finalizer" #3 daemon prio=8 os_prio=0 cpu=0.25ms elapsed=384.18s allocated=0B defined_classes=0 tid=0x00007fcf44064000 nid=0x10 in Object.wait()  [0x00007fcf2bae0000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@14.0.2/Native Method)
        - waiting on <0x000000009fe0b228> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(java.base@14.0.2/ReferenceQueue.java:155)
        - locked <0x000000009fe0b228> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(java.base@14.0.2/ReferenceQueue.java:176)
        at java.lang.ref.Finalizer$FinalizerThread.run(java.base@14.0.2/Finalizer.java:170)

   Locked ownable synchronizers:
        - None

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.48ms elapsed=384.17s allocated=528B defined_classes=0 tid=0x00007fcf4406b000 nid=0x11 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Service Thread" #5 daemon prio=9 os_prio=0 cpu=0.19ms elapsed=384.17s allocated=0B defined_classes=0 tid=0x00007fcf4406d000 nid=0x12 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"C2 CompilerThread0" #6 daemon prio=9 os_prio=0 cpu=474.39ms elapsed=384.17s allocated=0B defined_classes=0 tid=0x00007fcf4406f000 nid=0x13 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
        - None

"C1 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=511.58ms elapsed=384.17s allocated=7768B defined_classes=0 tid=0x00007fcf44071000 nid=0x14 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
        - None

"Sweeper thread" #8 daemon prio=9 os_prio=0 cpu=8.89ms elapsed=384.16s allocated=0B defined_classes=0 tid=0x00007fcf44073000 nid=0x15 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Notification Thread" #9 daemon prio=9 os_prio=0 cpu=0.10ms elapsed=384.07s allocated=0B defined_classes=0 tid=0x00007fcf440b5800 nid=0x16 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Common-Cleaner" #10 daemon prio=8 os_prio=0 cpu=0.73ms elapsed=384.06s allocated=0B defined_classes=0 tid=0x00007fcf440be800 nid=0x18 in Object.wait()  [0x00007fcf2aff1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(java.base@14.0.2/Native Method)
        - waiting on <0x000000009ff15a18> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(java.base@14.0.2/ReferenceQueue.java:155)
        - locked <0x000000009ff15a18> (a java.lang.ref.ReferenceQueue$Lock)
        at jdk.internal.ref.CleanerImpl.run(java.base@14.0.2/CleanerImpl.java:148)
        at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)
        at jdk.internal.misc.InnocuousThread.run(java.base@14.0.2/InnocuousThread.java:134)

   Locked ownable synchronizers:
        - None

"HttpClient-1-SelectorManager" #11 daemon prio=5 os_prio=0 cpu=16.06ms elapsed=381.44s allocated=19248B defined_classes=3 tid=0x00007fcf442b4000 nid=0x19 runnable  [0x00007fcf2a66e000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
        at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
        - locked <0x00000000a09eb2a0> (a sun.nio.ch.Util$2)
        - locked <0x00000000a09eb020> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
        at jdk.internal.net.http.HttpClientImpl$SelectorManager.run(java.net.http@14.0.2/HttpClientImpl.java:878)

   Locked ownable synchronizers:
        - None

"locust4j-stats#0#stats-timer" #12 prio=5 os_prio=0 cpu=16.04ms elapsed=381.43s allocated=17720B defined_classes=1 tid=0x00007fcf442bb800 nid=0x1a waiting on condition  [0x00007fcf2a56d000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(java.base@14.0.2/Native Method)
        at com.github.myzhan.locust4j.stats.Stats$StatsTimer.run(Stats.java:256)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@14.0.2/Executors.java:515)
        at java.util.concurrent.FutureTask.run(java.base@14.0.2/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.2/ThreadPoolExecutor.java:1130)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.2/ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)

   Locked ownable synchronizers:
        - <0x00000000a0a0ad38> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"locust4j-stats#1#stats" #13 prio=5 os_prio=0 cpu=15.67ms elapsed=381.42s allocated=168K defined_classes=1 tid=0x00007fcf442bd000 nid=0x1b in Object.wait()  [0x00007fcf2a46c000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@14.0.2/Native Method)
        - waiting on <0x00000000a09fa918> (a java.lang.Object)
        at java.lang.Object.wait(java.base@14.0.2/Object.java:321)
        at com.github.myzhan.locust4j.stats.Stats.sleep(Stats.java:110)
        - locked <0x00000000a09fa918> (a java.lang.Object)
        at com.github.myzhan.locust4j.stats.Stats.run(Stats.java:159)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@14.0.2/Executors.java:515)
        at java.util.concurrent.FutureTask.run(java.base@14.0.2/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.2/ThreadPoolExecutor.java:1130)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.2/ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)

   Locked ownable synchronizers:
        - <0x00000000a0a0e2a8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"reaper-1" #14 daemon prio=5 os_prio=0 cpu=3.63ms elapsed=381.33s allocated=9552B defined_classes=14 tid=0x00007fcf442de000 nid=0x1c runnable  [0x00007fcf2a36b000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
        at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
        - locked <0x00000000a0c66538> (a sun.nio.ch.Util$2)
        - locked <0x00000000a0c663c0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
        at zmq.poll.Poller.run(Poller.java:234)
        at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)

   Locked ownable synchronizers:
        - None

"iothread-2" #15 daemon prio=5 os_prio=0 cpu=32.22ms elapsed=381.32s allocated=397K defined_classes=65 tid=0x00007fcf442ed800 nid=0x1d runnable  [0x00007fcf2a26a000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
        at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
        - locked <0x00000000a0c6d448> (a sun.nio.ch.Util$2)
        - locked <0x00000000a0c6d2d0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
        at zmq.poll.Poller.run(Poller.java:234)
        at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)

   Locked ownable synchronizers:
        - None

"Thread-2receive-from-client" #16 prio=5 os_prio=0 cpu=22.51ms elapsed=381.17s allocated=355K defined_classes=24 tid=0x00007fcf44309800 nid=0x1e runnable  [0x00007fcf29f66000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
        at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
        - locked <0x00000000a0e07828> (a sun.nio.ch.Util$2)
        - locked <0x00000000a0e076b0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
        at zmq.Signaler.waitEvent(Signaler.java:130)
        at zmq.Mailbox.recv(Mailbox.java:90)
        at zmq.SocketBase.processCommands(SocketBase.java:910)
        at zmq.SocketBase.recv(SocketBase.java:831)
        at org.zeromq.ZMQ$Socket.recv(ZMQ.java:2646)
        at org.zeromq.ZMQ$Socket.recv(ZMQ.java:2634)
        at com.github.myzhan.locust4j.rpc.ZeromqClient.recv(ZeromqClient.java:41)
        at com.github.myzhan.locust4j.runtime.Runner$Receiver.run(Runner.java:362)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@14.0.2/Executors.java:515)
        at java.util.concurrent.FutureTask.run(java.base@14.0.2/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.2/ThreadPoolExecutor.java:1130)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.2/ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)

   Locked ownable synchronizers:
        - <0x00000000a0e6bf68> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"Attach Listener" #17 daemon prio=9 os_prio=0 cpu=0.39ms elapsed=0.11s allocated=0B defined_classes=0 tid=0x00007fcf14001000 nid=0x4e waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"VM Thread" os_prio=0 cpu=119.42ms elapsed=384.19s tid=0x00007fcf4405f000 nid=0xe runnable

"VM Periodic Task Thread" os_prio=0 cpu=272.66ms elapsed=384.07s tid=0x00007fcf440b7800 nid=0x17 waiting on condition

JNI global refs: 14, weak refs: 0

attaching capture from old session. workers.zip

myzhan commented 3 years ago

I see, which version of jeromq are you using? It looks like an issue of jeromq, and I never meet this issue. Maybe we can write some workarounds in locust4j.

moovoo commented 3 years ago

I use default jeromq 0.4.3. the issue is also reproducible on the latest jeromq lib. I tested this workaround: moved Receiver thread start after client_ready message. Works for me everytime, though I did not have a chance to fully test it.

Also I checked zmq docs and found this:

Don’t share ZeroMQ sockets between threads. ZeroMQ sockets are not threadsafe. Technically it’s possible to migrate a socket from one thread to another but it demands skill. The only place where it’s remotely sane to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.

myzhan commented 3 years ago

@moovoo I can't reproduce this issue, but I have committed the workaround, please give it a try.

moovoo commented 3 years ago

@myzhan I run latest version with both locust ui and headless mods a couple of dozen times: it is no longer reproducible, all workers register and send heartbeats all the time

myzhan commented 3 years ago

@moovoo ok, I will release later, thanks for your time.

myzhan commented 3 years ago

1.0.14 has been released

myzhan / locust4j

Worker stuck in loop on startup #30