Closed agatsoh closed 4 years ago
The following investigations took place:
I checked stderr/stdout of the nodes for errors, but found none
I checked that the transport servers were available during the time of the node shutdown; them being unavailable at the time of execution was cause for sudden node kills in the past. The transport servers were up.
@paul and I checked the raiden service logs at the time of failure but found no issue
I checked the syslog logs, but found no irregularities at the time of execution
I checked the metrics of the node - neither CPU nor memory were anywhere near their respective max capacities.
MS4 from today also fails when performing close channel REST-API call:
Logs:
Ok the new logging worked:
Raiden node 8 died with non-zero exit status: -9
-9
is SIGKILL
. This means something actively killed the node. The only thing that I can think of right now that would do this automatically is the OOM (out of memory killer).
Nothing is reported in the syslog, which normally should have an entry if something was killed by the OOM (but since the SP run happens inside a docker container I don't know what the expected behaviour should be).
However the monitoring shows we never exceeded ~30% RAM usage on the system. The monitoring only samples data every minute though, that means there could be a very sharp memory usage spike that then lead to the node being killed which would be invisible on the monitoring.
If the OOM killed a container, that should show up in the syslogs. However, we can always limit the memory used for the container we run. If we do not constrain the memory (as is the case atm), the OOM killer could possibly have killed the container. But as I said - it should still show up in the sys logs.
Related docs: https://docs.docker.com/config/containers/resource_constraints/
Specifically, the --disable-oom-killer
option states:
By default, if an out-of-memory (OOM) error occurs, the kernel kills processes in a container. To change this behavior, use the --oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option. If the -m flag is not set, the host can run out of memory and the kernel may need to kill the host system’s processes to free memory.
Looks like something like this also happened on 15th of December for BF1 where it fails with:
"message": "HTTP status code \"500\" while fetching http://127.0.0.1:36201/api/v1/channels/0x62083c80353Df771426D209eF578619EE68D5C7A/0xae8037d15CE130D298611a937c1142f9c3A22189. Expected 200: Internal Server Error
Logs:
Happened in ms4 on 15th of december with Error performing REST-API call: close_channel
Logs:
wow that happens a lot ...
Planning - We thought about the following steps to approach that bug:
SIGKILL
SIGKILL
(@nlsdfnbch will change the configuration of the docker container is not deleted automatically)Thanks for uploading the logs! Actually the scenario failed due to #5498 you can see that in
FATAL: Processing Matrix response took 28.34271264076233s, poll timeout is 20s.
in the stdout of node0`
Problem Definition
Error performing REST-API call: transfer
Fails at line 82
- transfer: {from: 0, to: 3, amount: 1_000_000_000_000_000, expected_http_status: 200}
/api/v1/payments/0x62083c80353Df771426D209eF578619EE68D5C7A/0x94E968d8c6De67288755A9F9a9b901B2a4b8cd01
pfs8_mediator_goes_offline_2019_12_10.tar.gz