jupyter-server / jupyter_server

The backend—i.e. core services, APIs, and REST endpoints—to Jupyter web applications.
https://jupyter-server.readthedocs.io
BSD 3-Clause "New" or "Revised" License
465 stars 279 forks source link

Improve the busy/idle execution state tracking for kernels. #1429

Open ojarjur opened 3 weeks ago

ojarjur commented 3 weeks ago

This PR makes two adjustments/enhancements for the way kernel execution state is tracked:

  1. Distinguish between "busy"/"idle" status messages with different parent IDs.

    A "busy" or "idle" status message is not sent in isolation, but rather in response to a specific parent message, and the kernel might still be busy with one parent message even after reporting that it is done (e.g. has an "idle" status) with a different parent message.

    Accordingly, the overall kernel execution state is only switched to "idle" once all previously seen "busy" status messages have had a corresponding "idle" message sent with the same parent message ID.

  2. Distinguish between different types of parent message when determining whether or not the overall kernel is "busy".

    Not every message type corresponds to a user action, so not all of them should be included in the logic for determining that a kernel is "busy". This is especially true for the case of kernel culling where the distinction between "idle" and "busy" is used to decide whether or not to cull the kernels.

    Since there might be different setups where the server admin might want to include/exclude different types of messages from this calculation, the set of message types used for status tracking is configurable.

Fixes #1360

ojarjur commented 3 weeks ago

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.

I'll try to eliminate the flakyness and then update this PR

ojarjur commented 3 weeks ago

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.

I'll try to eliminate the flakyness and then update this PR

Fixed now; I ran the test 100 times locally and it passed every time.

ojarjur commented 3 weeks ago

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error. I'll try to eliminate the flakyness and then update this PR

Fixed now; I ran the test 100 times locally and it passed every time.

I think there's a second race condition contributing to the test flakiness; the second one isn't an issue in the test but rather a preexisting race condition in the code that the test uncovered.

Specifically, this code can wind up being run after this code if the kernel starts up quickly enough...

That, in turn, can result in a correctly set "busy" and/or "idle" kernel execution state being overwritten with an erroneous state of "starting".

I'm not sure if the kernel execution state should be set at all in the _async_start_kernel method (it seems to only matter if the kernel manager's ready method raises an exception), but I'm quite certain that if it is set then it must be set before the _finish_kernel_start method is invoked.