Snapshot Phase 2 may get stuck

mtopolnik commented 4 years ago

While solving a test problem in #2454, we realized the underlying cause was a snapshot stuck in phase 2. System load seems to have been light immediately before and after the getting stuck event.

viliam-durina commented 4 years ago

some logs:

2020-08-24 01:57:48,264 [DEBUG] [stressTest_withRestart_graceful] [c.h.j.Job]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Sending RESTART_GRACEFUL request for job 04db-caac-b042-0001 (name ??)
2020-08-24 01:57:48,264 [DEBUG] [hz.stoic_hamilton.cached.thread-3] [c.h.j.i.MasterSnapshotContext]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Not beginning snapshot since one is already in progress job '04db-caac-b042-0001', execution 04db-caac-b047-0001
2020-08-24 01:57:48,265 [DEBUG] [hz.stoic_hamilton.jet.cooperative.thread-47] [c.h.j.i.o.SnapshotPhase1Operation]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Snapshot 8 phase 1 for job '04db-caac-b042-0001', execution 04db-caac-b047-0001 finished successfully on member
2020-08-24 01:57:48,277 [DEBUG] [hz.stoic_hamilton.cached.thread-8] [c.h.j.i.MasterSnapshotContext]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Snapshot 8 phase 1 for job '04db-caac-b042-0001', execution 04db-caac-b047-0001 completed with status SUCCESS in 116ms, 12,612 bytes, 271 keys in 168 chunks, stored in '__jet.snapshot.04db-caac-b042-0001.0', proceeding to phase 2
2020-08-24 01:57:48,292 [DEBUG] [hz.stoic_hamilton.cached.thread-8] [c.h.j.i.JobRepository]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Cleared snapshot data map __jet.snapshot.04db-caac-b042-0001.1
2020-08-24 01:57:48,293 [DEBUG] [hz.stoic_hamilton.cached.thread-8] [c.h.j.i.e.ExecutionContext]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Starting snapshot 8 phase 2 for job '04db-caac-b042-0001', execution 04db-caac-b047-0001 on member
2020-08-24 01:57:51,778 [DEBUG] [hz.stoic_hamilton.cached.thread-6] [c.h.j.i.JobRepository]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Job cleanup took 5ms
2020-08-24 01:57:53,928 [DEBUG] [Time-limited test] [c.h.j.Job]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] Sending CANCEL_FORCEFUL request for job 04db-caac-b042-0001 (name ??)
2020-08-24 01:57:53,929 [ WARN] [hz.stoic_hamilton.cached.thread-3] [c.h.j.i.JobCoordinationService]: [127.0.0.1]:5701 [jet] [4.3-SNAPSHOT] null
java.lang.IllegalStateException: Cannot CANCEL_FORCEFUL: Job is already terminating in mode: RESTART_GRACEFUL

The CANCEL_FORCEFUL was initiated in SimpleTestInClusterSupport.supportAfter() that calls ditchJob. The test actually passed even though the snapshot phase 2 was stuck (at least it seems so from the logs). It attempted to cancel the job 10 times with 500ms delay, the job was still already terminating in mode: RESTART_GRACEFUL

viliam-durina commented 4 years ago

Occurred again in #2473. In both cases the logs are as follows:

- Sending RESTART_GRACEFUL request
- Not beginning snapshot since one is already in progress
- Snapshot N phase 1 finished successfully on member
- Starting snapshot 8 phase 2 on member
(here it gets stuck)

viliam-durina commented 3 years ago

Possibly resolved by #2824. Closing for now, we can reopen if it ever occurs again.

hazelcast / hazelcast-jet

Snapshot Phase 2 may get stuck #2456