Discussion: run data streaming outside notebook environment

Tom-Willemsen commented 2 years ago

We are running into an increasing number of issues/complications with running the data streaming code within a notebook environment, for example:

Needing to use asyncio to not block the notebook entirely
Leaking threads/asyncio tasks when re-running jupyter cells
Re-running the cell containing the plot can cause various exceptions if this occurs while one of the data streaming tasks is updating it (race condition)

There are also potential complications with having data streaming in a notebook from a data analysis perspective, where we would want to re-stream reduced data for consumption by analysis programs. We feel that having this in a notebook is error-prone if a user changes the notebook mid-stream.

While these issues may all be fixable, it feels like we are not using notebooks "as intended" here and therefore are exposing ourselves to more bugs/complications than necessary.

I think we should discuss alternative approaches for running the data streaming infrastructure outside a notebook environment, for example running the stream listener as a standalone background python task, and having the data streaming plot be a matplotlib plot outside of a notebook environment.

SimonHeybrock commented 2 years ago

Is the key difficulty here that we cannot use a context manager in a notebook in a convenient manner?

SimonHeybrock commented 2 years ago

I think we should discuss alternative approaches for running the data streaming infrastructure outside a notebook environment, for example running the stream listener as a standalone background python task, and having the data streaming plot be a matplotlib plot outside of a notebook environment.

Is there anything in the current implementation that prevents this? That is, is it either-or?

Tom-Willemsen commented 2 years ago

Is the key difficulty here that we cannot use a context manager in a notebook in a convenient manner?

I think a context manager would only help if the code inside the context manager was blocking? If it's an asyncio call then it wouldn't help as we still wouldn't have a way to close the old threads/asyncio tasks when the cell gets re-run.

Running blocking code in the notebook I feel is not the right approach - even if the issues with plot interactivity could be fixed, there are other issues when trying to re-run cells containing blocking code (need to explicitly break the interpreter, wait for it to timeout, then re-run the cell).

Is there anything in the current implementation that prevents this? That is, is it either-or?

I think there's probably not anything specific in the implementation that prevents this, beyond the need to produce, test and document an "alternative" approach (if that's what we decide we want).

I'm not currently convinced that the overhead of maintaining both solutions would be worth it, but happy to have my opinion changed on this - what do you see as the advantage of the current solution which we couldn't reproduce in some alternative (e.g. standalone) solution? I guess maybe scientist familiarity with the notebook environment?

SimonHeybrock commented 2 years ago

If it's an asyncio call then it wouldn't help as we still wouldn't have a way to close the old threads/asyncio tasks when the cell gets re-run.

Wouldn't the context manager's __exit__ take care of that?

what do you see as the advantage of the current solution

I do not know enough about the current state... is there an implemented solution, apart from just something that shows how this is possible in a notebook?

which we couldn't reproduce in some alternative (e.g. standalone) solution?

Not having to write a custom application. But I do not have enough information to tell whether this is really simpler with Jupyter plus, e.g., Voila.

nvaytet commented 2 years ago

One thing that currently only works in a notebook is the instrument view, so if live streaming into and instrument view is a must have (I don't know if it is, maybe it's not the most useful visualization to have), then we still have to do things in a notebook, or at least voila.

Tom-Willemsen commented 2 years ago

Wouldn't the context manager's exit take care of that?

I'm not sure how this helps. If the task is blocking, then the context managers' __exit__ will never be called as the task gets forcibly terminated by jupyter if it's still running after a timeout, I believe. If the task is non-blocking, then the __exit__ would be called immediately?

There may be some hook in jupyter/ipython where we can listen for a "stop" event, but I didn't find one yet...

I do not know enough about the current state... is there an implemented solution, apart from just something that shows how this is possible in a notebook?

There is some code that displays data-streaming specific widgets in a notebook environment, for example. Parts or all of that might need to be rewritten if we decided to use a different solution. The plotting code should in principle be runnable outside a notebook, but is likely to need tweaking as it's only ever been tested in notebooks. But other than that, I'd say most of the underlying code is independent of running in a notebook or not.

SimonHeybrock commented 2 years ago

Partially related to this discussion, the consensus is that the current requirements for data streaming are too fuzzy and maybe too ambitious. It appears to stop us from make actual progress. Therefore:

Aim for a minimal working but useful solution.
- Must get away from the "we have a working prototype" situation as soon as possible.
Should be trivial to launch, e.g., based on config file, without complicated setup.
- Integration tests to ensure this keeps working over the coming years.
Minimal features:
- Small number of pre-configured live-updating plots (instrument view, normal plots).

scipp / beamlime

Discussion: run data streaming outside notebook environment #42