Error performing REST-API call / Sudden Node Death

agatsoh commented 4 years ago

Problem Definition

{
  "task": "<TransferTask: {'from': 0, 'to': 3, 'amount': 1000000000000000, 'expected_http_status': 200}>",
  "event": "Task errored",
  "logger": "scenario_player.tasks.base",
  "level": "error",
  "timestamp": "2019-12-10 09:27:10.193048",
  "exception": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/connection.py\", line 160, in _new_conn\n    (self._dns_host, self.port), self.timeout, **extra_kw)\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py\", line 80, in create_connection\n    raise err\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py\", line 70, in create_connection\n    sock.connect(sa)\n  File \"/usr/local/lib/python3.7/site-packages/gevent/_socket3.py\", line 335, in connect\n    raise error(err, strerror(err))\nConnectionRefusedError: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 603, in urlopen\n    chunked=chunked)\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 355, in _make_request\n    conn.request(method, url, **httplib_request_kw)\n  File \"/usr/local/lib/python3.7/http/client.py\", line 1229, in request\n    self._send_request(method, url, body, headers, encode_chunked)\n  File \"/usr/local/lib/python3.7/http/client.py\", line 1275, in _send_request\n    self.endheaders(body, encode_chunked=encode_chunked)\n  File \"/usr/local/lib/python3.7/http/client.py\", line 1224, in endheaders\n    self._send_output(message_body, encode_chunked=encode_chunked)\n  File \"/usr/local/lib/python3.7/http/client.py\", line 1016, in _send_output\n    self.send(msg)\n  File \"/usr/local/lib/python3.7/http/client.py\", line 956, in send\n    self.connect()\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/connection.py\", line 183, in connect\n    conn = self._new_conn()\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/connection.py\", line 169, in _new_conn\n    self, \"Failed to establish a new connection: %s\" % e)\nurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fd87ce34518>: Failed to establish a new connection: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/requests/adapters.py\", line 449, in send\n    timeout=timeout\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 641, in urlopen\n    _stacktrace=sys.exc_info()[2])\n  File \"/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py\", line 399, in increment\n    raise MaxRetryError(_pool, url, error or ResponseError(cause))\nurllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=34477): Max retries exceeded with url: /api/v1/payments/0x62083c80353Df771426D209eF578619EE68D5C7A/0x94E968d8c6De67288755A9F9a9b901B2a4b8cd01 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd87ce34518>: Failed to establish a new connection: [Errno 111] Connection refused'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/scenario_player/tasks/api_base.py\", line 51, in _run\n    method=self._method, url=url, json=self._request_params, timeout=self._timeout\n  File \"/usr/local/lib/python3.7/site-packages/requests/sessions.py\", line 533, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/usr/local/lib/python3.7/site-packages/requests/sessions.py\", line 646, in send\n    r = adapter.send(request, **kwargs)\n  File \"/usr/local/lib/python3.7/site-packages/requests/adapters.py\", line 516, in send\n    raise ConnectionError(e, request=request)\nrequests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=34477): Max retries exceeded with url: /api/v1/payments/0x62083c80353Df771426D209eF578619EE68D5C7A/0x94E968d8c6De67288755A9F9a9b901B2a4b8cd01 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd87ce34518>: Failed to establish a new connection: [Errno 111] Connection refused'))\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/scenario_player/tasks/base.py\", line 74, in __call__\n    return self._run(*args, **kwargs)\n  File \"/usr/local/lib/python3.7/site-packages/scenario_player/tasks/api_base.py\", line 56, in _run\n    raise RESTAPIError(f\"Error performing REST-API call: {self._name}\") from ex\nscenario_player.exceptions.legacy.RESTAPIError: Error performing REST-API call: transfer"
}

Error performing REST-API call: transfer
Fails at line 82

- transfer: {from: 0, to: 3, amount: 1_000_000_000_000_000, expected_http_status: 200}

while calling API /api/v1/payments/0x62083c80353Df771426D209eF578619EE68D5C7A/0x94E968d8c6De67288755A9F9a9b901B2a4b8cd01

pfs8_mediator_goes_offline_2019_12_10.tar.gz

deepbrook commented 4 years ago

The following investigations took place:

I checked stderr/stdout of the nodes for errors, but found none

I checked that the transport servers were available during the time of the node shutdown; them being unavailable at the time of execution was cause for sudden node kills in the past. The transport servers were up.

@paul and I checked the raiden service logs at the time of failure but found no issue

I checked the syslog logs, but found no irregularities at the time of execution

I checked the metrics of the node - neither CPU nor memory were anywhere near their respective max capacities.

Dominik1999 commented 4 years ago

it happened also tonight in pfs8

RESTAPIError(f\"Error performing REST-API call: {self._name}\") from ex\nscenario_player.exceptions.legacy.RESTAPIError: Error performing REST-API call: transfer"}

czepluch commented 4 years ago

MS3 also seems to fail like this from last night.

Logs:

czepluch commented 4 years ago

Same goes for MS2:

Logs:

czepluch commented 4 years ago

MS4 from today also fails when performing close channel REST-API call:

Logs:

czepluch commented 4 years ago

BF6 also fails like this. But there is updated logging with more info than the previous days.

Logs:

ulope commented 4 years ago

Ok the new logging worked:

Raiden node 8 died with non-zero exit status: -9

-9 is SIGKILL. This means something actively killed the node. The only thing that I can think of right now that would do this automatically is the OOM (out of memory killer).

Nothing is reported in the syslog, which normally should have an entry if something was killed by the OOM (but since the SP run happens inside a docker container I don't know what the expected behaviour should be).

However the monitoring shows we never exceeded ~30% RAM usage on the system. The monitoring only samples data every minute though, that means there could be a very sharp memory usage spike that then lead to the node being killed which would be invisible on the monitoring.

deepbrook commented 4 years ago

If the OOM killed a container, that should show up in the syslogs. However, we can always limit the memory used for the container we run. If we do not constrain the memory (as is the case atm), the OOM killer could possibly have killed the container. But as I said - it should still show up in the sys logs.

Related docs: https://docs.docker.com/config/containers/resource_constraints/ Specifically, the --disable-oom-killer option states:

By default, if an out-of-memory (OOM) error occurs, the kernel kills processes in a container. To change this behavior, use the --oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option. If the -m flag is not set, the host can run out of memory and the kernel may need to kill the host system’s processes to free memory.

czepluch commented 4 years ago

Happened again on 14th of dec for BF6 with RESTAPIError: Error performing REST-API call: assert.

Logs:

czepluch commented 4 years ago

Looks like something like this also happened on 15th of December for BF1 where it fails with:

"message": "HTTP status code \"500\" while fetching http://127.0.0.1:36201/api/v1/channels/0x62083c80353Df771426D209eF578619EE68D5C7A/0xae8037d15CE130D298611a937c1142f9c3A22189. Expected 200: Internal Server Error

Logs:

czepluch commented 4 years ago

Also happened to BF6 on 15th of December with RESTAPIError: Error performing REST-API call: assert

Logs:

palango commented 4 years ago

https://github.com/raiden-network/raiden/issues/5472#issuecomment-566002792 is issue https://github.com/raiden-network/raiden/issues/5498

czepluch commented 4 years ago

Happened in ms4 on 15th of december with Error performing REST-API call: close_channel

Logs:

czepluch commented 4 years ago

Happened in BF6 on 16th of december with RESTAPIError: Error performing REST-API call: assert

Logs:

Dominik1999 commented 4 years ago

wow that happens a lot ...

Dominik1999 commented 4 years ago

Planning - We thought about the following steps to approach that bug:

[ ] Everyone gets access to the Scenario Player Servers (@nlsdfnbch got our ssh keys and will grant access)
[ ] Check all logs on the server why we see SIGKILL
[ ] Check how docker reacts to SIGKILL (@nlsdfnbch will change the configuration of the docker container is not deleted automatically)
[ ] Looks into the Scenario Player Code when a node gets killed
[ ] Everyone runs a bf1 scenario and we try to reproduce it locally

czepluch commented 4 years ago

BF7 failed like this on 17th of Dec too

Logs:

czepluch commented 4 years ago

MS2 failed like this on dec 17th

Logs:

czepluch commented 4 years ago

BF6 failed like this on dev 17th with RESTAPIError: Error performing REST-API call: assert

Logs:

Dominik1999 commented 4 years ago

[ ] Nils to merge a PR from @hackaugusto https://github.com/raiden-network/scenario-player/pull/459
[ ] either no route found or node dies because of matrixsynctimeout -> SP kills it
[ ] Raiden waits for every http request to finish, so the Scenario Player kills it

czepluch commented 4 years ago

BF6 also failed on dec 18th

Logs:

Dominik1999 commented 4 years ago

Thanks for uploading the logs! Actually the scenario failed due to #5498 you can see that in

FATAL: Processing Matrix response took 28.34271264076233s, poll timeout is 20s. in the stdout of node0`

raiden-network / raiden

Error performing REST-API call / Sudden Node Death #5472

Problem Definition