raokrutarth / manufacturing-os

A POC distributed operating system for geo-seperated automotive manufacturing
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

Crash and Restart: handling all threads stop and restart #105

Closed chuanqichen closed 4 years ago

chuanqichen commented 4 years ago

WARNING 22:57:33 MainThread ops_generator.py:64::send_update_dep | Node 1 to update dependency WARNING 22:57:35 test ops runner-1 processes.py:161::on_recover | Recovering node 1 DEBUG 22:57:35 test ops runner-1 processes.py:193::_attempt_log_recovery | Node 1 succeeds in recovery heartbeat from log DEBUG 22:57:35 test ops runner-1 sc_stage.py:106::_attempt_log_recovery | Node 1 succeeds in recovery from sc stage log WARNING 22:57:36 MainThread ops_generator.py:64::send_update_dep | Node 2 to update dependency WARNING 22:57:48 test ops runner-4 processes.py:161::on_recover | Recovering node 4 DEBUG 22:57:49 test ops runner-4 processes.py:193::_attempt_log_recovery | Node 4 succeeds in recovery heartbeat from log DEBUG 22:57:49 test ops runner-4 sc_stage.py:106::_attempt_log_recovery | Node 4 succeeds in recovery from sc stage log WARNING 22:57:49 test ops runner-2 processes.py:161::on_recover | Recovering node 2 DEBUG 22:57:50 test ops runner-2 processes.py:193::_attempt_log_recovery | Node 2 succeeds in recovery heartbeat from log DEBUG 22:57:50 test ops runner-2 sc_stage.py:106::_attempt_log_recovery | Node 2 succeeds in recovery from sc stage log WARNING 22:57:53 MainThread ops_generator.py:64::send_update_dep | Node 9 to update dependency WARNING 22:58:04 test ops runner-0 processes.py:161::on_recover | Recovering node 0 DEBUG 22:58:04 test ops runner-0 processes.py:193::_attempt_log_recovery | Node 0 succeeds in recovery heartbeat from log DEBUG 22:58:04 test ops runner-0 sc_stage.py:106::_attempt_log_recovery | Node 0 succeeds in recovery from sc stage log WARNING 22:58:07 test ops runner-2 processes.py:161::on_recover | Recovering node 2 DEBUG 22:58:07 test ops runner-2 processes.py:193::_attempt_log_recovery | Node 2 succeeds in recovery heartbeat from log DEBUG 22:58:07 test ops runner-2 sc_stage.py:106::_attempt_log_recovery | Node 2 succeeds in recovery from sc stage log

chuanqichen commented 4 years ago

EBUG 00:30:49 test ops runner-9 processes.py:193::_attempt_log_recovery | Node 9 succeeds in recovery heartbeat from log INFO 00:30:49 test ops runner-9 sc_stage.py:110::_attempt_inbound_recovery | Node 9 attempting post-crash recovery to verify status of batches in inbound WAL of size 0 INFO 00:30:49 test ops runner-9 sc_stage.py:128::_attempt_outbound_recovery | Node 9 attempting post-crash recovery to verify status of batches in outbound WAL of size 48 DEBUG 00:30:50 test ops runner-5 processes.py:193::_attempt_log_recovery | Node 5 succeeds in recovery heartbeat from log INFO 00:30:50 test ops runner-5 sc_stage.py:110::_attempt_inbound_recovery | Node 5 attempting post-crash recovery to verify status of batches in inbound WAL of size 0 INFO 00:30:50 test ops runner-5 sc_stage.py:128::_attempt_outbound_recovery | Node 5 attempting post-crash recovery to verify status of batches in outbound WAL of size 17 DEBUG 00:30:53 test ops runner-8 processes.py:193::_attempt_log_recovery | Node 8 succeeds in recovery heartbeat from log INFO 00:30:53 test ops runner-8 sc_stage.py:110::_attempt_inbound_recovery | Node 8 attempting post-crash recovery to verify status of batches in inbound WAL of size 0 INFO 00:30:53 test ops runner-8 sc_stage.py:128::_attempt_outbound_recovery | Node 8 attempting post-crash recovery to verify status of batches in outbound WAL of size 37 WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log WARNING 00:30:54 test ops runner-9 processes.py:192::_attempt_log_recovery | ### Node 9 has issues to recovery from log DEBUG 00:30:54 test ops runner-9 processes.py:193::_attempt_log_recovery | Node 9 succeeds in recovery heartbeat from log INFO 00:30:54 test ops runner-9 sc_stage.py:110::_attempt_inbound_recovery | Node 9 attempting post-crash recovery to verify status of batches in inbound WAL of size 0 INFO 00:30:54 test ops runner-9 sc_stage.py:128::_attempt_outbound_recovery | Node 9 attempting post-crash recovery to verify status of batches in outbound WAL of size 48 DEBUG 00:31:00 test ops runner-7 processes.py:193::_attempt_log_recovery | Node 7 succeeds in recovery heartbeat from log INFO 00:31:00 test ops runner-7 sc_stage.py:110::_attempt_inbound_recovery | Node 7 attempting post-crash recovery to verify status of batches in inbound WAL of size 7 INFO 00:31:00 test ops runner-7 sc_stage.py:128::_attempt_outbound_recovery | Node 7 attempting post-crash recovery to verify status of batches in outbound WAL of size 4 DEBUG 00:31:06 test ops runner-6 processes.py:193::_attempt_log_recovery | Node 6 succeeds in recovery heartbeat from log INFO 00:31:06 test ops runner-6 sc_stage.py:110::_attempt_inbound_recovery | Node 6 attempting post-crash recovery to verify status of batches in inbound WAL of size 0 INFO 00:31:06 test ops runner-6 sc_stage.py:128::_attempt_outbound_recovery | Node 6 attempting post-crash recovery to verify status of batches in outbound WAL of size 5 DEBUG 00:31:11 test ops runner-8 processes.py:193::_attempt_log_recovery | Node 8 succeeds in recovery heartbeat from log INFO 00:31:11 test ops runner-8 sc_stage.py:110::_attempt_inbound_recovery | Node 8 attempting post-crash recovery to verify status of batches in inbound WAL of size 0 INFO 00:31:11 test ops runner-8 sc_stage.py:128::_attempt_outbound_recovery | Node 8 attempting post-crash recovery to verify status of batches in outbound WAL of size 37 DEBUG 00:31:19 test ops runner-10 processes.py:193::_attempt_log_recovery | Node 10 succeeds in recovery heartbeat from log INFO 00:31:19 test ops runner-10 sc_stage.py:110::_attempt_inbound_recovery | Node 10 attempting post-crash recovery to verify status of batches in inbound WAL of size 6 INFO 00:31:19 test ops runner-10 sc_stage.py:128::_attempt_outbound_recovery | Node 10 attempting post-crash recovery to verify status of batches in outbound WAL of size 2 DEBUG 00:31:21 test ops runner-4 processes.py:193::_attempt_log_recovery | Node 4 succeeds in recovery heartbeat from log INFO 00:31:21 test ops runner-4 sc_stage.py:110::_attempt_inbound_recovery | Node 4 attempting post-crash recovery to verify status of batches in inbound WAL of size 4 INFO 00:31:21 test ops runner-4 sc_stage.py:128::_attempt_outbound_recovery | Node 4 attempting post-crash recovery to verify status of batches in outbound WAL of size 2 DEBUG 00:31:28 test ops runner-8 processes.py:193::_attempt_log_recovery | Node 8 succeeds in recovery heartbeat from log INFO 00:31:28 test ops runner-8 sc_stage.py:110::_attempt_inbound_recovery | Node 8 attempting post-crash recovery to verify status of batches in inbound WAL of size 0 INFO 00:31:28 test ops runner-8 sc_stage.py:128::_attempt_outbound_recovery | Node 8 attempting post-crash recovery to verify status of batches in outbound WAL of size 37

chuanqichen commented 4 years ago

Overnight run logging looks good: WARNING 00:36:23 MainThread ops_generator.py:44::kill_node | Node 2 to be killed WARNING 00:36:23 test ops runner-2 processes.py:152::on_kill | Killing node 2 CRITICAL 00:36:23 test ops runner-2 sc_stage.py:170::stop | Node 2's stage is being stopped ... WARNING 00:36:27 subscriber-2 processes.py:117::onMessage | Node 2 got message Message(from:5, to:-1, type:MsgType.Request, action:Action.Heartbeat) but node is marked inactive so ignoring message ... WARNING 00:36:29 MainThread ops_generator.py:54::recover_node | Node 2 to be recovered ... WARNING 00:36:29 test ops runner-2 processes.py:157::on_recover | Recovering node 2

DEBUG 00:36:29 test ops runner-2 processes.py:193::_attempt_log_recovery | Node 2 succeeds in recovery heartbeat from log DEBUG 00:36:29 subscriber-4 messages.py:465::on_heartbeat_resp | Received Heartbeat Response from 8: on 4 INFO 00:36:29 test ops runner-2 sc_stage.py:94::_attempt_log_recovery | Node 2's batch status WALs are empty. Skipping post-crash item and material status verification DEBUG 00:36:29 test ops runner-2 cluster.py:223::bootstrap_flow | Bootstrapping flow has started. INFO 00:36:29 heartbeat-2 threads.py:151::run | Node 2 sending heartbeat Message(from:2, to:-1, type:MsgType.Request, action:Action.Heartbeat) INFO 00:36:29 test ops runner-2 cluster.py:228::bootstrap_flow | Cluster flow with all possible paths created: ClusterWideFlow(...)

INFO 00:36:29 heartbeat-2 messages.py:359::sendMessage | sending message Message(from:2, to:-1, type:MsgType.Request, action:Action.Heartbeat) from node 2 INFO 00:36:29 test ops runner-2 cluster.py:244::bootstrap_flow | Cluster flow with one specific path created: ClusterWideFlow( incoming edges:{0: [], 1: [(7, ItemReq(Item(type:4), quantity:1)), (5, ItemReq(Item(type:3), quantity:1))], 2: [(1, ItemReq(Item(type:1), quantity:1)), (7, ItemReq(Item(type:4), quantity:1))], 3: [], 4: [], 5: [(0, ItemReq(Item(type:0), quantity:1))], 6: [], 7: [(0, ItemReq(Item(type:0), quantity:1))], 8: [(2, ItemReq(Item(type:2), quantity:1))]} outgoing edges:{0: [(7, ItemReq(Item(type:0), quantity:1)), (5, ItemReq(Item(type:0), quantity:1))], 1: [(2, ItemReq(Item(type:1), quantity:1))], 2: [(8, ItemReq(Item(type:2), quantity:1))], 3: [], 4: [], 5: [(1, ItemReq(Item(type:3), quantity:1))], 6: [], 7: [(1, ItemReq(Item(type:4), quantity:1)), (2, ItemReq(Item(type:4), quantity:1))], 8: []} node IDs: [0, 1, 2, 3, 4, 5, 6, 7, 8] ) INFO 00:36:29 test ops runner-2 processes.py:169::update_flow | Node 2 updated flow due to node 2

raokrutarth commented 4 years ago

please use code bloacks like below:

code/logs here are more readadble since they are not line wrapped