Open olukas opened 3 years ago
Looks like the snapshot is stuck at phase 1
Snaphot 0 is started at 19:47:14,023
, it prints the log from the test too Will save 934 to snapshot
2021-04-08 19:47:14,023 [DEBUG] [hz.zealous_fermat.cached.thread-55] [c.h.j.i.MasterSnapshotContext]: [127.0.0.1]:5701 [jet] [4.5-SNAPSHOT] Starting snapshot 0 for job '0601-006a-1b80-0001', execution 0601-006a-1b81-0001, flags: terminal=no,export=no, writing to: null
2021-04-08 19:47:14,024 [DEBUG] [hz.zealous_fermat.cached.thread-55] [c.h.j.i.e.ExecutionContext]: [127.0.0.1]:5701 [jet] [4.5-SNAPSHOT] Starting snapshot 0 phase 1 for job '0601-006a-1b80-0001', execution 0601-006a-1b81-0001 on member
2021-04-08 19:47:14,024 [DEBUG] [hz.magical_fermat.generic-operation.thread-21] [c.h.j.i.e.ExecutionContext]: [127.0.0.1]:5702 [jet] [4.5-SNAPSHOT] Starting snapshot 0 phase 1 for job '0601-006a-1b80-0001', execution 0601-006a-1b81-0001 on member
Will save 934 to snapshot
At 19:47:14,538
second member declares that it finished phase 1
2021-04-08 19:47:14,538 [DEBUG] [hz.magical_fermat.jet.cooperative.thread-6] [c.h.j.i.o.SnapshotPhase1Operation]: [127.0.0.1]:5702 [jet] [4.5-SNAPSHOT] Snapshot 0 phase 1 for job '0601-006a-1b80-0001', execution 0601-006a-1b81-0001 finished successfully on member
after 10 seconds, assertion times out and test fails. only then we see this log
2021-04-08 19:47:24,688 [ WARN] [hz.zealous_fermat.async.thread-1] [c.h.j.i.o.SnapshotPhase1Operation]: [127.0.0.1]:5701 [jet] [4.5-SNAPSHOT] Snapshot 0 phase 1 for job '0601-006a-1b80-0001', execution 0601-006a-1b81-0001 finished with an error on member: java.util.concurrent.CancellationException: execution cancelled
We discussed this issue with Viliam and our findings are:
Member at 5701 got stuck in Snapshot 0 phase 1. It saved the state of src
and sliding-window-prepare
and then somehow it got stuck. It should save state for sliding-window
and listSink
, but it didn't.
Looking at the snapshot process in 5702, everything looks normal there. Since the source stage is not distributed, it runs only on single member, apparently 5701 in this test. So, snapshot phase 1 for src
and sliding-window-prepare
completed immediately on 5702. Member 5702 saved state only for sliding-window
(the second stage of sliding window) and listSink
and completed the snapshot phase 1 successfully.
Considering this log below, sliding window (second stage of sliding window) at member 5701 run on hz.zealous_fermat.jet.cooperative.thread-4
.
2021-04-08 19:47:14,093 [ INFO] [hz.zealous_fermat.jet.cooperative.thread-4] [c.h.j.i.p.SlidingWindowP]: [127.0.0.1]:5701 [jet] [4.5-SNAPSHOT] [0601-006a-1b80-0001/sliding-window#0] Output to ordinal 0: Watermark{ts=00:00:00.900}
When we checked the thread dump for this thread, we did not encounter any problems. Actually, this thread appears to be idle in the thread dump.
master (commit 4eb9c6bb3aac6b2c827e07c346237f6bc8f5db1b)
Failed on Oracle JDK 11: http://jenkins.hazelcast.com/job/jet-oss-master-oracle-jdk11/370/testReport/junit/com.hazelcast.jet.pipeline/SourceBuilder_TopologyChangeTest/test_restartJob_nodeTerminated/
Stacktrace:
Standard output: