Closed moovoo closed 3 years ago
I see no race condition from the thread stack. If some workers are missiong on the web page, you can check the logs of locust master, maybe the workers don't sent heartbeat request as fast as expect, because of the k8s node performance, then they are removed by the master.
I checked the master, it does not consider worker as alive because problematic workers never send 'client_ready' or 'heartbeat' messages. They never send it because worker's main thread stuck in a loop inside this.rpcClient.send(new Message("client_ready", null, this.nodeID));
, more specifically in zmq.Signaler.recv. (see how much time it is stuck there)
Here are some captures. Good situation:
00000000 ff 00 00 00 00 00 00 00 01 7f ........ ..
00000000 ff 00 00 00 00 00 00 00 40 7f 03 ........ @..
0000000A 03 01 4e 55 4c 4c 00 00 00 00 00 00 00 00 00 00 ..NULL.. ........
0000001A 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000002A 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000003A 00 00 00 00 00 00 ......
0000000B 00 4e 55 4c 4c 00 00 00 00 00 00 00 00 00 00 00 .NULL... ........
0000001B 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000002B 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000003B 00 00 00 00 00 .....
00000040 04 29 05 52 45 41 44 59 0b 53 6f 63 6b 65 74 2d .).READY .Socket-
00000050 54 79 70 65 00 00 00 06 52 4f 55 54 45 52 08 49 Type.... ROUTER.I
00000060 64 65 6e 74 69 74 79 00 00 00 00 dentity. ...
00000040 04 68 05 52 45 41 44 59 0b 53 6f 63 6b 65 74 2d .h.READY .Socket-
00000050 54 79 70 65 00 00 00 06 44 45 41 4c 45 52 08 49 Type.... DEALER.I
00000060 64 65 6e 74 69 74 79 00 00 00 3f 6c 6f 63 75 73 dentity. ..?locus
00000070 74 2d 77 6f 72 6b 65 72 2d 36 37 36 38 39 36 39 t-worker -6768969
00000080 63 62 37 2d 74 36 32 70 6b 5f 38 63 39 30 61 39 cb7-t62p k_8c90a9
00000090 39 33 31 65 64 34 36 37 62 62 61 61 65 62 31 66 931ed467 bbaaeb1f
000000A0 37 33 38 62 36 38 62 30 34 36 738b68b0 46
000000AA 00 50 93 ac 63 6c 69 65 6e 74 5f 72 65 61 64 79 .P..clie nt_ready
000000BA c0 d9 3f 6c 6f 63 75 73 74 2d 77 6f 72 6b 65 72 ..?locus t-worker
000000CA 2d 36 37 36 38 39 36 39 63 62 37 2d 74 36 32 70 -6768969 cb7-t62p
000000DA 6b 5f 38 63 39 30 61 39 39 33 31 65 64 34 36 37 k_8c90a9 931ed467
000000EA 62 62 61 61 65 62 31 66 37 33 38 62 36 38 62 30 bbaaeb1f 738b68b0
000000FA 34 36 46
000000FC 00 74 93 a9 68 65 61 72 74 62 65 61 74 82 a5 73 .t..hear tbeat..s
0000010C 74 61 74 65 a5 72 65 61 64 79 b1 63 75 72 72 65 tate.rea dy.curre
0000011C 6e 74 5f 63 70 75 5f 75 73 61 67 65 cb 00 00 00 nt_cpu_u sage....
0000012C 00 00 00 00 00 d9 3f 6c 6f 63 75 73 74 2d 77 6f ......?l ocust-wo
0000013C 72 6b 65 72 2d 36 37 36 38 39 36 39 63 62 37 2d rker-676 8969cb7-
0000014C 74 36 32 70 6b 5f 38 63 39 30 61 39 39 33 31 65 t62pk_8c 90a9931e
0000015C 64 34 36 37 62 62 61 61 65 62 31 66 37 33 38 62 d467bbaa eb1f738b
0000016C 36 38 62 30 34 36 68b046
00000172 00 74 93 a9 68 65 61 72 74 62 65 61 74 82 a5 73 .t..hear tbeat..s
00000182 74 61 74 65 a5 72 65 61 64 79 b1 63 75 72 72 65 tate.rea dy.curre
00000192 6e 74 5f 63 70 75 5f 75 73 61 67 65 cb 40 10 29 nt_cpu_u sage.@.)
000001A2 5f ad 40 a5 7f d9 3f 6c 6f 63 75 73 74 2d 77 6f _.@...?l ocust-wo
000001B2 72 6b 65 72 2d 36 37 36 38 39 36 39 63 62 37 2d rker-676 8969cb7-
000001C2 74 36 32 70 6b 5f 38 63 39 30 61 39 39 33 31 65 t62pk_8c 90a9931e
000001D2 64 34 36 37 62 62 61 61 65 62 31 66 37 33 38 62 d467bbaa eb1f738b
000001E2 36 38 62 30 34 36 68b046
They establish connection, exchange identities (router/dealer) and the worker start to send heartbeats from now on.
Bad situation:
00000000 ff 00 00 00 00 00 00 00 01 7f ........ ..
00000000 ff 00 00 00 00 00 00 00 40 7f 03 ........ @..
0000000A 03 01 4e 55 4c 4c 00 00 00 00 00 00 00 00 00 00 ..NULL.. ........
0000001A 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000002A 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000003A 00 00 00 00 00 00 ......
0000000B 00 4e 55 4c 4c 00 00 00 00 00 00 00 00 00 00 00 .NULL... ........
0000001B 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000002B 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
0000003B 00 00 00 00 00 .....
00000040 04 29 05 52 45 41 44 59 0b 53 6f 63 6b 65 74 2d .).READY .Socket-
00000050 54 79 70 65 00 00 00 06 52 4f 55 54 45 52 08 49 Type.... ROUTER.I
00000060 64 65 6e 74 69 74 79 00 00 00 00 dentity. ...
00000040 04 68 05 52 45 41 44 59 0b 53 6f 63 6b 65 74 2d .h.READY .Socket-
00000050 54 79 70 65 00 00 00 06 44 45 41 4c 45 52 08 49 Type.... DEALER.I
00000060 64 65 6e 74 69 74 79 00 00 00 3f 6c 6f 63 75 73 dentity. ..?locus
00000070 74 2d 77 6f 72 6b 65 72 2d 36 37 36 38 39 36 39 t-worker -6768969
00000080 63 62 37 2d 63 36 6a 71 74 5f 39 36 65 35 30 36 cb7-c6jq t_96e506
00000090 36 35 38 36 66 36 30 32 37 37 30 36 66 64 64 63 6586f602 7706fddc
000000A0 37 66 30 65 39 30 62 63 32 33 7f0e90bc 23
and that's it, worker never send client_ready/heartbeat.
My guess is that server message with Socket-type ROUTER
/identity get processed by two threads simultaneously: Receiver thread and main thread. Usually it never happen on fast enough machines because Receiver thread manages to process it before main thread fall into zmq.SocketBase.processCommands
check. But on a busy machine main thread might accidentally go into that check and find that there something to read from socket, while Receiver thread flushes all buffers in meantime.
When you reproduce this issue, are you sure the locust master is reachable? I think a dealer socket will connect to the master synchronously, that will block the main thread to send client_ready. And I have added some logs to debug https://github.com/myzhan/locust4j/commit/166e152a45dcd118648c06c80d3d7e80de8e5235 , you can pull the master branch to build.
BTW, it will be better if you can provide a pcap file.
It always goes beyond connection point, so I always see that 'connected to ...' message in all cases, no connectivity issues.
I patched zmq lib a bit around processCommands
with additional logging.
Good case:
2021-02-26 11:17:19.768 [main] DEBUG kek - hello world
2021-02-26 11:17:22.176 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.177 [main] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.258 [main] DEBUG com.github.myzhan.locust4j.rpc.ZeromqClient - Locust4j is connected to master(locust-master:5557)
2021-02-26 11:17:22.267 [Thread-2receive-from-client] DEBUG com.github.myzhan.locust4j.runtime.Runner - Receiver started
2021-02-26 11:17:22.269 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.306 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.307 [main] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.456 [Thread-2receive-from-client] DEBUG Socketbase - Processing cmd Cmd[Dealer[1], 3, OWN, SessionBase-Dealer[1]]
2021-02-26 11:17:22.457 [Thread-2receive-from-client] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.458 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:23.495 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:23.497 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:24.507 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:24.507 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:25.515 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:25.515 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:26.518 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:26.518 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:27.522 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:27.522 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:28.535 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:28.536 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:29.538 [Thread-4heartbeat] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:29.538 [Thread-4heartbeat] DEBUG Socketbase - exiting processCommands
...
Bad case:
2021-02-26 11:17:19.742 [main] DEBUG kek - hello world
2021-02-26 11:17:22.277 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.277 [main] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.326 [main] DEBUG com.github.myzhan.locust4j.rpc.ZeromqClient - Locust4j is connected to master(locust-master:5557)
2021-02-26 11:17:22.427 [Thread-2receive-from-client] DEBUG com.github.myzhan.locust4j.runtime.Runner - Receiver started
2021-02-26 11:17:22.472 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.471 [main] DEBUG Socketbase - Entering processCommands
2021-02-26 11:17:22.547 [Thread-2receive-from-client] DEBUG Socketbase - Processing cmd Cmd[Dealer[1], 3, OWN, SessionBase-Dealer[1]]
2021-02-26 11:17:22.548 [Thread-2receive-from-client] DEBUG Socketbase - exiting processCommands
2021-02-26 11:17:22.548 [Thread-2receive-from-client] DEBUG Socketbase - Entering processCommands
main and receiver enter processCommands almost simultaneously, but sometimes main thread is not lucky enough.
Here is the full stack trace for all threads:
# jstack -l -e 1
2021-02-26 11:23:43
Full thread dump OpenJDK 64-Bit Server VM (14.0.2+12 mixed mode, sharing):
Threads class SMR info:
_java_thread_list=0x00007fcf14002880, length=17, elements={
0x00007fcf44028800, 0x00007fcf44062000, 0x00007fcf44064000, 0x00007fcf4406b000,
0x00007fcf4406d000, 0x00007fcf4406f000, 0x00007fcf44071000, 0x00007fcf44073000,
0x00007fcf440b5800, 0x00007fcf440be800, 0x00007fcf442b4000, 0x00007fcf442bb800,
0x00007fcf442bd000, 0x00007fcf442de000, 0x00007fcf442ed800, 0x00007fcf44309800,
0x00007fcf14001000
}
"main" #1 prio=5 os_prio=0 cpu=349343.71ms elapsed=384.21s allocated=13869K defined_classes=1083 tid=0x00007fcf44028800 nid=0xd runnable [0x00007fcf49970000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.read0(java.base@14.0.2/Native Method)
at sun.nio.ch.FileDispatcherImpl.read(java.base@14.0.2/FileDispatcherImpl.java:48)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(java.base@14.0.2/IOUtil.java:276)
at sun.nio.ch.IOUtil.read(java.base@14.0.2/IOUtil.java:245)
at sun.nio.ch.IOUtil.read(java.base@14.0.2/IOUtil.java:223)
at sun.nio.ch.SourceChannelImpl.read(java.base@14.0.2/SourceChannelImpl.java:277)
at zmq.Signaler.recv(Signaler.java:159)
at zmq.Mailbox.recv(Mailbox.java:97)
at zmq.SocketBase.processCommands(SocketBase.java:935)
at zmq.SocketBase.send(SocketBase.java:699)
at org.zeromq.ZMQ$Socket.send(ZMQ.java:2588)
at org.zeromq.ZMQ$Socket.send(ZMQ.java:2577)
at com.github.myzhan.locust4j.rpc.ZeromqClient.send(ZeromqClient.java:48)
at com.github.myzhan.locust4j.runtime.Runner.getReady(Runner.java:340)
at com.github.myzhan.locust4j.Locust.run(Locust.java:183)
- locked <0x00000000a0144b10> (a com.github.myzhan.locust4j.Locust)
at com.github.myzhan.locust4j.Locust.run(Locust.java:158)
at com.sampleworker.App.main(App.java:84)
Locked ownable synchronizers:
- <0x00000000a0e07580> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.18ms elapsed=384.18s allocated=0B defined_classes=0 tid=0x00007fcf44062000 nid=0xf waiting on condition [0x00007fcf2bbe1000]
java.lang.Thread.State: RUNNABLE
at java.lang.ref.Reference.waitForReferencePendingList(java.base@14.0.2/Native Method)
at java.lang.ref.Reference.processPendingReferences(java.base@14.0.2/Reference.java:241)
at java.lang.ref.Reference$ReferenceHandler.run(java.base@14.0.2/Reference.java:213)
Locked ownable synchronizers:
- None
"Finalizer" #3 daemon prio=8 os_prio=0 cpu=0.25ms elapsed=384.18s allocated=0B defined_classes=0 tid=0x00007fcf44064000 nid=0x10 in Object.wait() [0x00007fcf2bae0000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@14.0.2/Native Method)
- waiting on <0x000000009fe0b228> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@14.0.2/ReferenceQueue.java:155)
- locked <0x000000009fe0b228> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@14.0.2/ReferenceQueue.java:176)
at java.lang.ref.Finalizer$FinalizerThread.run(java.base@14.0.2/Finalizer.java:170)
Locked ownable synchronizers:
- None
"Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.48ms elapsed=384.17s allocated=528B defined_classes=0 tid=0x00007fcf4406b000 nid=0x11 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"Service Thread" #5 daemon prio=9 os_prio=0 cpu=0.19ms elapsed=384.17s allocated=0B defined_classes=0 tid=0x00007fcf4406d000 nid=0x12 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"C2 CompilerThread0" #6 daemon prio=9 os_prio=0 cpu=474.39ms elapsed=384.17s allocated=0B defined_classes=0 tid=0x00007fcf4406f000 nid=0x13 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
Locked ownable synchronizers:
- None
"C1 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=511.58ms elapsed=384.17s allocated=7768B defined_classes=0 tid=0x00007fcf44071000 nid=0x14 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
Locked ownable synchronizers:
- None
"Sweeper thread" #8 daemon prio=9 os_prio=0 cpu=8.89ms elapsed=384.16s allocated=0B defined_classes=0 tid=0x00007fcf44073000 nid=0x15 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"Notification Thread" #9 daemon prio=9 os_prio=0 cpu=0.10ms elapsed=384.07s allocated=0B defined_classes=0 tid=0x00007fcf440b5800 nid=0x16 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"Common-Cleaner" #10 daemon prio=8 os_prio=0 cpu=0.73ms elapsed=384.06s allocated=0B defined_classes=0 tid=0x00007fcf440be800 nid=0x18 in Object.wait() [0x00007fcf2aff1000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@14.0.2/Native Method)
- waiting on <0x000000009ff15a18> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@14.0.2/ReferenceQueue.java:155)
- locked <0x000000009ff15a18> (a java.lang.ref.ReferenceQueue$Lock)
at jdk.internal.ref.CleanerImpl.run(java.base@14.0.2/CleanerImpl.java:148)
at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)
at jdk.internal.misc.InnocuousThread.run(java.base@14.0.2/InnocuousThread.java:134)
Locked ownable synchronizers:
- None
"HttpClient-1-SelectorManager" #11 daemon prio=5 os_prio=0 cpu=16.06ms elapsed=381.44s allocated=19248B defined_classes=3 tid=0x00007fcf442b4000 nid=0x19 runnable [0x00007fcf2a66e000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
- locked <0x00000000a09eb2a0> (a sun.nio.ch.Util$2)
- locked <0x00000000a09eb020> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
at jdk.internal.net.http.HttpClientImpl$SelectorManager.run(java.net.http@14.0.2/HttpClientImpl.java:878)
Locked ownable synchronizers:
- None
"locust4j-stats#0#stats-timer" #12 prio=5 os_prio=0 cpu=16.04ms elapsed=381.43s allocated=17720B defined_classes=1 tid=0x00007fcf442bb800 nid=0x1a waiting on condition [0x00007fcf2a56d000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base@14.0.2/Native Method)
at com.github.myzhan.locust4j.stats.Stats$StatsTimer.run(Stats.java:256)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@14.0.2/Executors.java:515)
at java.util.concurrent.FutureTask.run(java.base@14.0.2/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.2/ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.2/ThreadPoolExecutor.java:630)
at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)
Locked ownable synchronizers:
- <0x00000000a0a0ad38> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"locust4j-stats#1#stats" #13 prio=5 os_prio=0 cpu=15.67ms elapsed=381.42s allocated=168K defined_classes=1 tid=0x00007fcf442bd000 nid=0x1b in Object.wait() [0x00007fcf2a46c000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@14.0.2/Native Method)
- waiting on <0x00000000a09fa918> (a java.lang.Object)
at java.lang.Object.wait(java.base@14.0.2/Object.java:321)
at com.github.myzhan.locust4j.stats.Stats.sleep(Stats.java:110)
- locked <0x00000000a09fa918> (a java.lang.Object)
at com.github.myzhan.locust4j.stats.Stats.run(Stats.java:159)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@14.0.2/Executors.java:515)
at java.util.concurrent.FutureTask.run(java.base@14.0.2/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.2/ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.2/ThreadPoolExecutor.java:630)
at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)
Locked ownable synchronizers:
- <0x00000000a0a0e2a8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"reaper-1" #14 daemon prio=5 os_prio=0 cpu=3.63ms elapsed=381.33s allocated=9552B defined_classes=14 tid=0x00007fcf442de000 nid=0x1c runnable [0x00007fcf2a36b000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
- locked <0x00000000a0c66538> (a sun.nio.ch.Util$2)
- locked <0x00000000a0c663c0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
at zmq.poll.Poller.run(Poller.java:234)
at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)
Locked ownable synchronizers:
- None
"iothread-2" #15 daemon prio=5 os_prio=0 cpu=32.22ms elapsed=381.32s allocated=397K defined_classes=65 tid=0x00007fcf442ed800 nid=0x1d runnable [0x00007fcf2a26a000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
- locked <0x00000000a0c6d448> (a sun.nio.ch.Util$2)
- locked <0x00000000a0c6d2d0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
at zmq.poll.Poller.run(Poller.java:234)
at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)
Locked ownable synchronizers:
- None
"Thread-2receive-from-client" #16 prio=5 os_prio=0 cpu=22.51ms elapsed=381.17s allocated=355K defined_classes=24 tid=0x00007fcf44309800 nid=0x1e runnable [0x00007fcf29f66000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@14.0.2/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@14.0.2/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@14.0.2/SelectorImpl.java:129)
- locked <0x00000000a0e07828> (a sun.nio.ch.Util$2)
- locked <0x00000000a0e076b0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@14.0.2/SelectorImpl.java:141)
at zmq.Signaler.waitEvent(Signaler.java:130)
at zmq.Mailbox.recv(Mailbox.java:90)
at zmq.SocketBase.processCommands(SocketBase.java:910)
at zmq.SocketBase.recv(SocketBase.java:831)
at org.zeromq.ZMQ$Socket.recv(ZMQ.java:2646)
at org.zeromq.ZMQ$Socket.recv(ZMQ.java:2634)
at com.github.myzhan.locust4j.rpc.ZeromqClient.recv(ZeromqClient.java:41)
at com.github.myzhan.locust4j.runtime.Runner$Receiver.run(Runner.java:362)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@14.0.2/Executors.java:515)
at java.util.concurrent.FutureTask.run(java.base@14.0.2/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.2/ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.2/ThreadPoolExecutor.java:630)
at java.lang.Thread.run(java.base@14.0.2/Thread.java:832)
Locked ownable synchronizers:
- <0x00000000a0e6bf68> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"Attach Listener" #17 daemon prio=9 os_prio=0 cpu=0.39ms elapsed=0.11s allocated=0B defined_classes=0 tid=0x00007fcf14001000 nid=0x4e waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"VM Thread" os_prio=0 cpu=119.42ms elapsed=384.19s tid=0x00007fcf4405f000 nid=0xe runnable
"VM Periodic Task Thread" os_prio=0 cpu=272.66ms elapsed=384.07s tid=0x00007fcf440b7800 nid=0x17 waiting on condition
JNI global refs: 14, weak refs: 0
attaching capture from old session. workers.zip
I see, which version of jeromq are you using? It looks like an issue of jeromq, and I never meet this issue. Maybe we can write some workarounds in locust4j.
I use default jeromq 0.4.3. the issue is also reproducible on the latest jeromq lib. I tested this workaround: moved Receiver thread start after client_ready message. Works for me everytime, though I did not have a chance to fully test it.
Also I checked zmq docs and found this:
Don’t share ZeroMQ sockets between threads. ZeroMQ sockets are not threadsafe. Technically it’s possible to migrate a socket from one thread to another but it demands skill. The only place where it’s remotely sane to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.
@moovoo I can't reproduce this issue, but I have committed the workaround, please give it a try.
@myzhan I run latest version with both locust ui and headless mods a couple of dozen times: it is no longer reproducible, all workers register and send heartbeats all the time
@moovoo ok, I will release later, thanks for your time.
1.0.14 has been released
locust4j: 1.0.12 master: 1.2 or 1.4, looks like it doesn't really matter
scenario:
expected results: number of available workers equals number of deployed workers actual results: sometimes a few workers are missing on the page, while it actually runs and consumes 100% CPU
It happens from time to time, looks like it depends on k8s node performance (too slow or too few cores available) and cause race condition while startup
Stack trace of the main thread stuck in that cycle forever (other threads are waiting):
Basically the worker can't send ready status to master, though it actually established a connection.
speculation Not entirely sure, but my guess that the race condition happen here:
Probably need to start receiver thread after client_ready cmd, or use some sort of lock on client