hanfei1991 / microcosm

a mini bench expreriment for a task runtime scheduler
8 stars 6 forks source link

jobmanager(engine): clean tombstone worker before failover #375

Closed amyangfei closed 2 years ago

amyangfei commented 2 years ago

Fix the second bug in https://github.com/hanfei1991/microcosm/issues/357

When a server master leader restarts, it loads all workers in worker manager. After master is ready, the job manager should clean tombstone workers before re-creating them. Otherwise duplicated entries will be found in worker manger and panic happens.

This bug is also found in existing e2e test

2022-05-10T04:20:12.5745042Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.284644541Z [2022/05/10 04:19:53.284 +00:00] [INFO] [job_fsm.go:208] ["tombstone job master doesn't receive heartbeat in time, recreate it"] [job="{\"WorkerHandle\":null,\"seq-id\":2,\"created-at\":\"2022-05-10T04:18:37.994Z\",\"updated-at\":\"2022-05-10T04:18:38.017Z\",\"project-id\":\"\",\"id\":\"d571bccf-e8c8-4880-92af-289fa44e12ce\",\"type\":3,\"status\":2,\"node-id\":\"485937b6-69d4-4328-ba4b-e43971f08602\",\"addr\":\"server-executor-0:10241\",\"epoch\":3,\"config\":\"eyJqb2ItbmFtZSI6InRlc3Qtbm9kZS1mYWlsdXJlIiwid29ya2VyLWNvdW50Ijo0LCJ0YXJnZXQtdGljayI6MTAwMDAwMDAsImV0Y2Qtd2F0Y2gtZW5hYmxlIjp0cnVlLCJldGNkLWVuZHBvaW50cyI6WyJ1c2VyLWV0Y2Qtc3RhbmRhbG9uZToyMzc5Il0sImV0Y2Qtd2F0Y2gtcHJlZml4IjoiL2Zha2Utam9iL3Rlc3QvIn0=\"}"]
2022-05-10T04:20:12.5749143Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.285235942Z [2022/05/10 04:19:53.285 +00:00] [INFO] [client_manager.go:61] ["client manager adds executor"] [id=aae85123-77eb-48a5-876b-7dcf96801583] [addr=server-executor-1:10241]
2022-05-10T04:20:12.5751229Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.289124854Z [2022/05/10 04:19:53.288 +00:00] [PANIC] [worker_manager.go:320] ["worker already exists"] [worker-id=d571bccf-e8c8-4880-92af-289fa44e12ce] [stack="github.com/hanfei1991/microcosm/lib/master.(*WorkerManager).BeforeStartingWorker\n\t/dataflow-engine/lib/master/worker_manager.go:320\ngithub.com/hanfei1991/microcosm/lib.(*DefaultBaseMaster).CreateWorker.func1.2\n\t/dataflow-engine/lib/master.go:559\ngithub.com/hanfei1991/microcosm/client.(*TaskDispatcher).DispatchTask\n\t/dataflow-engine/client/task_dispatcher.go:77\ngithub.com/hanfei1991/microcosm/lib.(*DefaultBaseMaster).CreateWorker.func1\n\t/dataflow-engine/lib/master.go:558"]
2022-05-10T04:20:12.5752439Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291603061Z panic: worker already exists
2022-05-10T04:20:12.5752872Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291621061Z
2022-05-10T04:20:12.5753955Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291624961Z goroutine 1510 [running]:
2022-05-10T04:20:12.5808276Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291628161Z go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc001134240, {0xc00261ec00, 0x1, 0x1})
2022-05-10T04:20:12.5823722Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291647461Z  /go/pkg/mod/go.uber.org/zap@v1.21.0/zapcore/entry.go:232 +0x44c
2022-05-10T04:20:12.5824665Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291651961Z go.uber.org/zap.(*Logger).Panic(0x1f66a20?, {0x23362cb?, 0xc001ffbc80?}, {0xc00261ec00, 0x1, 0x1})
2022-05-10T04:20:12.5825298Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291752161Z  /go/pkg/mod/go.uber.org/zap@v1.21.0/logger.go:230 +0x59
2022-05-10T04:20:12.5826069Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291763361Z github.com/hanfei1991/microcosm/lib/master.(*WorkerManager).BeforeStartingWorker(0xc00111b040, {0xc001ffbc80, 0x24}, {0xc001ffbe00, 0x24})
2022-05-10T04:20:12.5826749Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291767061Z  /dataflow-engine/lib/master/worker_manager.go:320 +0x1fa
2022-05-10T04:20:12.5827391Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291770461Z github.com/hanfei1991/microcosm/lib.(*DefaultBaseMaster).CreateWorker.func1.2()
2022-05-10T04:20:12.5827996Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291773761Z  /dataflow-engine/lib/master.go:559 +0x34
2022-05-10T04:20:12.5828764Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291777061Z github.com/hanfei1991/microcosm/client.(*TaskDispatcher).DispatchTask(0xc001116cc0?, {0x284afa0, 0xc002615e60}, 0xc0024d1840, 0xc002687ec0, 0xc001055a60)
2022-05-10T04:20:12.5829656Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291780661Z  /dataflow-engine/client/task_dispatcher.go:77 +0x12d
2022-05-10T04:20:12.5830283Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291783861Z github.com/hanfei1991/microcosm/lib.(*DefaultBaseMaster).CreateWorker.func1()
2022-05-10T04:20:12.5831441Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291787061Z  /dataflow-engine/lib/master.go:558 +0x74c
2022-05-10T04:20:12.6519201Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291790761Z created by github.com/hanfei1991/microcosm/lib.(*DefaultBaseMaster).CreateWorker
2022-05-10T04:20:12.6520005Z ESC[36;1mserver-master-1_1         |ESC[0m 2022-05-10T04:19:53.291794362Z  /dataflow-engine/lib/master.go:519 +0x758
codecov-commenter commented 2 years ago

Codecov Report

Merging #375 (f6b67cd) into master (487aaab) will increase coverage by 0.0495%. The diff coverage is 0.0000%.

@@               Coverage Diff                @@
##             master       #375        +/-   ##
================================================
+ Coverage   55.0827%   55.1323%   +0.0495%     
================================================
  Files           135        135                
  Lines         10575      10580         +5     
================================================
+ Hits           5825       5833         +8     
- Misses         4301       4303         +2     
+ Partials        449        444         -5