raiden-network / scenario-player

MIT License
1 stars 16 forks source link

Don't report nodes killed by the scenario player as failures #474

Closed hackaugusto closed 4 years ago

hackaugusto commented 4 years ago

Do this after https://github.com/raiden-network/scenario-player/issues/468.

Under some circumstances a node will die and the scenario player will stop the run, if the stop takes longer than expected a SIGKILL is sent. When this happens the scenario player should not report the death of the node as a failure, since that is masquerading the real problem.

Dominik1999 commented 4 years ago

So we want the scenario player to continue the run even after is has sent a SIGKILL to a node. So the SP should log that killing, but continue to run the scenario.

Dominik1999 commented 4 years ago

@ulope

hackaugusto commented 4 years ago

So we want the scenario player to continue the run even after is has sent a SIGKILL to a node

This is the case, the problem is that the scenario player wrongly reports the kill it has done as an error.

Dominik1999 commented 4 years ago

We need to adjust that code https://github.com/raiden-network/scenario-player/blob/a8e8c16bb0b010289f2267cb667b44cb3149f926/scenario_player/node_support.py#L277

such that the scenario player should stop the execution, but atm the wrong error is reported.

A potential approach is:

Dominik1999 commented 4 years ago
Dominik1999 commented 4 years ago

Are you working on this issue atm, @konradkonrad ?

Dominik1999 commented 4 years ago

It looks like atm the Scenario Player does not report a node death as a failure

here http://scenario-player.ci.raiden.network/scenarios/bf1_basic_functionality/scenario-player-run_bf1_basic_functionality_2020-02-25T00%3A29%3A30.log.gz

we see Node error of Node 1 at 2020-02-25 01:40:24

whereas the scenario continues working 5 minutes more until 2020-02-25 01:40:24.930022

Dominik1999 commented 4 years ago

This issue seems not to be solved. Tonight all scenarios failed because of the same reason, the pfs was not online. But some of the nodes stopped by 1 and other by 9. So the Scenario Player does seem to still report a node being killed (that indicates the -9) as a failure.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Scenario scenario-player-run_pfs5_too_low_capacity_2020-03-04T21:21:29.log.gz failed
"Failed to start Raiden node 2: The process invoked by the <scenario_player.utils.legacy.HTTPExecutor: \"/usr/local/bin/raiden --accept-disclaimer --datadir /data/scenarios/pfs5_too_low_capacity/node_148_002 --keystore-path /data/scenarios/pfs5_too_low_capacity/node_148_002/keys --address 0x570d8775D0BFe1E43AC93468F61835BC8E762C86 --password-file /data/scenarios/pfs5_too_low_capacity/node_148_002/password.txt --network-id 5 --sync-check --gas-price fast --eth-rpc-endpoint http://parity.goerli.ethnodes.brainbot.com:8545 --log-config :info,raiden:debug,raiden_contracts:debug,raiden.api.rest.pywsgi:warning --log-json --log-file /data/scenarios/pfs5_too_low_capacity/node_148_002/run-148.log --disable-debug-logfile --matrix-server auto --api-address 127.0.0.1:33081 --no-web-ui --pathfinding-service-address https://pfs-goerli-with-fee.services-dev.raiden.network --gas-price fast --proportional-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0 --environment-type development --pathfinding-max-fee 100 --routing-mode pfs --pathfinding-max-paths 1 --proportional-imbalance-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0\" 0x7f38a9ac8278> executor has exited with a non-zero code: -9."
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ReturnCode.FATAL
ReturnCode.FATAL
ReturnCode.FATAL
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Scenario scenario-player-run_pfs7_multiple_payments_2020-03-04T21:22:44.log.gz failed
"Failed to start Raiden node 2: The process invoked by the <scenario_player.utils.legacy.HTTPExecutor: \"/usr/local/bin/raiden --accept-disclaimer --datadir /data/scenarios/pfs7_multiple_payments/node_36_002 --keystore-path /data/scenarios/pfs7_multiple_payments/node_36_002/keys --address 0x80540e19ed101a9B9Ea481d070f3e3Ea74e09E06 --password-file /data/scenarios/pfs7_multiple_payments/node_36_002/password.txt --network-id 5 --sync-check --gas-price fast --eth-rpc-endpoint http://parity.goerli.ethnodes.brainbot.com:8545 --log-config :info,raiden:debug,raiden_contracts:debug,raiden.api.rest.pywsgi:warning --log-json --log-file /data/scenarios/pfs7_multiple_payments/node_36_002/run-036.log --disable-debug-logfile --matrix-server auto --api-address 127.0.0.1:42513 --no-web-ui --pathfinding-service-address https://pfs-goerli-with-fee.services-dev.raiden.network --routing-mode pfs --gas-price fast --pathfinding-max-fee 100 --proportional-imbalance-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0 --pathfinding-max-paths 5 --environment-type development --proportional-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0\" 0x7f7996cd1438> executor has exited with a non-zero code: -9."
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ReturnCode.FATAL
ReturnCode.FATAL
ReturnCode.FATAL
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Scenario scenario-player-run_ms2_simple_monitoring_2020-03-04T21:17:58.log.gz failed
"Failed to start Raiden node 0: The process invoked by the <scenario_player.utils.legacy.HTTPExecutor: \"/usr/local/bin/raiden --accept-disclaimer --datadir /data/scenarios/ms2_simple_monitoring/node_68_000 --keystore-path /data/scenarios/ms2_simple_monitoring/node_68_000/keys --address 0x4F6b94c616bC26A2149fba304a11aFba1Eb0ac06 --password-file /data/scenarios/ms2_simple_monitoring/node_68_000/password.txt --network-id 5 --sync-check --gas-price fast --eth-rpc-endpoint http://parity.goerli.ethnodes.brainbot.com:8545 --log-config :info,raiden:debug,raiden_contracts:debug,raiden.api.rest.pywsgi:warning --log-json --log-file /data/scenarios/ms2_simple_monitoring/node_68_000/run-068.log --disable-debug-logfile --matrix-server auto --api-address 127.0.0.1:43227 --no-web-ui --pathfinding-service-address https://pfs-goerli.services-dev.raiden.network --enable-monitoring --proportional-imbalance-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0 --default-settle-timeout 40 --default-reveal-timeout 20 --proportional-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0 --environment-type development --gas-price fast\" 0x7fd234027080> executor has exited with a non-zero code: 1."
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ReturnCode.FATAL
ReturnCode.FATAL
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Scenario scenario-player-run_pfs6_simple_path_rewards_2020-03-04T21:22:44.log.gz failed
"Failed to start Raiden node 2: The process invoked by the <scenario_player.utils.legacy.HTTPExecutor: \"/usr/local/bin/raiden --accept-disclaimer --datadir /data/scenarios/pfs6_simple_path_rewards/node_44_002 --keystore-path /data/scenarios/pfs6_simple_path_rewards/node_44_002/keys --address 0x031365F3aDD6dfC8a304f7C27Ed617E204AcE7E1 --password-file /data/scenarios/pfs6_simple_path_rewards/node_44_002/password.txt --network-id 5 --sync-check --gas-price fast --eth-rpc-endpoint http://parity.goerli.ethnodes.brainbot.com:8545 --log-config :info,raiden:debug,raiden_contracts:debug,raiden.api.rest.pywsgi:warning --log-json --log-file /data/scenarios/pfs6_simple_path_rewards/node_44_002/run-044.log --disable-debug-logfile --matrix-server auto --api-address 127.0.0.1:33001 --no-web-ui --pathfinding-service-address https://pfs-goerli-with-fee.services-dev.raiden.network --pathfinding-max-paths 5 --pathfinding-max-fee 100 --gas-price fast --environment-type development --proportional-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0 --proportional-imbalance-fee 0x59105441977ecD9d805A4f5b060E34676F50F806 0 --routing-mode pfs\" 0x7f24ab1356d8> executor has exited with a non-zero code: 1."
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ReturnCode.FATAL
ReturnCode.FATAL
ReturnCode.FATAL
ReturnCode.FATAL
karlb commented 4 years ago

Apparently https://github.com/raiden-network/scenario-player/pull/498 does not always show the relevant exception when one node fails and the other are killed. In scenario-player-run_pfs7_multiple_payments_2020-03-04T21:22:44.log.gz, one node could not access the PFS, but the exception from a killed node is shown.

Talk to @konradkonrad about this.

karlb commented 4 years ago

The few tests I did with the new janitor looked good. But the nightly SP runs are the more informative test, so I'll have a look at those next.

hackaugusto commented 4 years ago

This is not fixed, nodes are reported with Process did not exit cleanly, example from scenario-player-run_bf7_long_path_2020-05-10T22:52:32.log.gz

{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.054334"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.055202"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.055805"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.056237"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.056772"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.057258"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.057739"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.058130"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.058549"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.059117"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.059753"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.104362"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.105203"}
{"exit_code": 1, "communicate": [null, null], "event": "Process did not exit cleanly", "logger": "raiden.utils.nursery", "level": "warning", "timestamp": "2020-05-10 22:53:00.105660"}
karlb commented 4 years ago

This is not fixed, nodes are reported with Process did not exit cleanly, example from scenario-player-run_bf7_long_path_2020-05-10T22:52:32.log.gz

We expect the raiden nodes to shut down cleanly after receiving a SIGINT. This did not happen (exit code 1), so this warning is printed. This seems reasonable SP behavior to me and I would blame the raiden nodes for not shutting down cleanly. Do you disagree?

But I do consider https://github.com/raiden-network/raiden/issues/6128 an example of cases where we have error messages for killed raiden nodes we do not want to see, so I'll investigate that.