Burst retrying websocket connection causes gateway down

esevan commented 5 years ago

Since #42 does not limit the retry count, I found EG was OOM killed by too many retrying request to connect to unknown kernel_id.

I couldn't find why it requested for unknown kernel_id, but I think this can be prevented by adding exponential backoff algorithm and limiting the retry count.

I lost the error log of nb2kg, so attach the following EG log instead. As I remember, nb2kg was writing the logs like Attempting to re-establish... over and over.

[D 2019-07-24 23:56:26.628 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/a91a8444-2847-41bc-9e1d-dcffd3884337/channels
[W 2019-07-24 23:56:26.629 EnterpriseGatewayApp] No session ID specified
[W 190724 23:56:26 web:1667] 404 GET /api/kernels/a91a8444-2847-41bc-9e1d-dcffd3884337/channels (10.233.101.97): Kernel does not exist: a91a8444-2847-41bc-9e1d-dcffd3884337
[D 2019-07-24 23:56:26.629 EnterpriseGatewayApp] Using contents: services/contents
[E 190724 23:56:26 web:1114] Uncaught exception in write_error
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/tornado/web.py", line 1592, in _execute
        result = yield result
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
        yielded = self.gen.throw(*exc_info)
      File "/opt/conda/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 241, in get
        yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
        yielded = self.gen.throw(*exc_info)
      File "/opt/conda/lib/python3.7/site-packages/notebook/base/zmqhandlers.py", line 283, in get
        yield gen.maybe_future(res)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
        yielded = self.gen.send(value)
      File "/opt/conda/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 223, in pre_get
        kernel = self.kernel_manager.get_kernel(self.kernel_id)
      File "/opt/conda/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 237, in get_kernel
        self._check_kernel_id(kernel_id)
      File "/opt/conda/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 387, in _check_kernel_id
        raise web.HTTPError(404, u'Kernel does not exist: %s' % kernel_id)
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: a91a8444-2847-41bc-9e1d-dcffd3884337)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/tornado/web.py", line 1112, in send_error
        self.write_error(status_code, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/notebook/base/handlers.py", line 531, in write_error
        html = self.render_template('%s.html' % status_code, **ns)
      File "/opt/conda/lib/python3.7/site-packages/notebook/base/handlers.py", line 461, in render_template
        template = self.get_template(name)
      File "/opt/conda/lib/python3.7/site-packages/notebook/base/handlers.py", line 457, in get_template
        return self.settings['jinja2_env'].get_template(name)
    KeyError: 'jinja2_env'
[W 190724 23:56:26 web:2162] 404 GET /api/kernels/a91a8444-2847-41bc-9e1d-dcffd3884337/channels (10.233.101.97) 1.81ms
[D 2019-07-24 23:56:26.634 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/a91a8444-2847-41bc-9e1d-dcffd3884337/channels
[W 2019-07-24 23:56:26.634 EnterpriseGatewayApp] No session ID specified
[W 190724 23:56:26 web:1667] 404 GET /api/kernels/a91a8444-2847-41bc-9e1d-dcffd3884337/channels (10.233.101.97): Kernel does not exist: a91a8444-2847-41bc-9e1d-dcffd3884337
[D 2019-07-24 23:56:26.635 EnterpriseGatewayApp] Using contents: services/contents
[E 190724 23:56:26 web:1114] Uncaught exception in write_error
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/tornado/web.py", line 1592, in _execute
        result = yield result
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
        yielded = self.gen.throw(*exc_info)
      File "/opt/conda/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 241, in get
        yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
        yielded = self.gen.throw(*exc_info)
      File "/opt/conda/lib/python3.7/site-packages/notebook/base/zmqhandlers.py", line 283, in get
        yield gen.maybe_future(res)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
        yielded = self.gen.send(value)
      File "/opt/conda/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 223, in pre_get
        kernel = self.kernel_manager.get_kernel(self.kernel_id)
      File "/opt/conda/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 237, in get_kernel
        self._check_kernel_id(kernel_id)
      File "/opt/conda/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 387, in _check_kernel_id
        raise web.HTTPError(404, u'Kernel does not exist: %s' % kernel_id)
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: a91a8444-2847-41bc-9e1d-dcffd3884337)

kevin-bates commented 5 years ago

Hi Evan, might the missing kernel-id stem from it being culled? I suspect the client "has record" of the kernel, but EG has since cleaned it up.

This side affect from the previous change is a little concerning. Does the retry happen every 5-6 milliseconds if not backed off? I.e., was your EG log filled with these stack traces in mere seconds?

So if we back-off, is it only decreasing the frequency of this log dumping?

If nb2kg gets the 404, I wonder if we should use a couple failed retries to trigger cleanup of the orphaned id on the client?

esevan commented 5 years ago

@kevin-bates Hi! I suspected the culling event too, but it was not the case. After culling kernel, client accepted it and abandoned the kernel.

I think it happend in either kernel restarting logic or kernel session recovery logic because I observed EG was restarted before it happened and there were a few orphan kernels.

As for my PR, I set the back-off to decrease load to EG and also set the retry limit to cleanup after a couple of retries; As you mentioned, EG was fully filled with those stack trace in a second.

If the root causes were found, I could have posted another PR to handle 404 response or about exact exception handling. This was only option to fix quickly the problem - fault tolerance.

kevin-bates commented 5 years ago

@esevan - thank you for the additional information. It sure would be nice to understand when this occurs. That said, I think we should probably move forward with handling the bursting and I've submitted a review for the PR.

When this happened, what kinds of process proxies were in use - k8s, yarn, distributed?

esevan commented 5 years ago

@kevin-bates Thanks, I'll dig into this problem continually. I used k8s process proxy with session persistence configuration. FWIW, I modified EG for integration with other services in our env, it was an another suspect I guess.

kevin-bates commented 5 years ago

Thanks Evan - I'll keep on the lookout as well.

jupyter / nb2kg

Burst retrying websocket connection causes gateway down #44