tenzir / public-roadmap

The public roadmap of Tenzir
https://docs.tenzir.com/roadmap
4 stars 0 forks source link

Forked Pipelines #131

Open dominiklohmann opened 1 year ago

dominiklohmann commented 1 year ago

Pipelines currently always run in the same process as the node when run through the API. This is a reliability problem, because a pipeline running out of memory also causes the node to go down, and with it all other pipelines.

This process is about decoupling the risk of running pipelines by running them in a child process instead.

### Stories
- [ ] https://github.com/tenzir/issues/issues/1547
- [ ] https://github.com/tenzir/issues/issues/745
lava commented 1 year ago

I think one big thing to keep in mind here is that this turns the tenzir process essentially into a process manager. So we have to be careful about keeping track of all these forked processes we spawn, and how to clean them up again on shutdown.

mavam commented 1 year ago

Our friends at Zeek put a lot of energy into getting process supervision right. It may make sense to study the supervisor framework at a very high level to avoid re-experiencing all the weird POSIX gotchas.

lava commented 1 year ago

We should also take a close look at a deeper cgroup integration of the node, after all this kind of scenario (ie. bounding memory/cpu usage of a group of processes and not losing track of them) is what they were invented for.

dominiklohmann commented 1 year ago

This came up again today when preparing for a demo. It's a real bummer to have a node crash because of a bug in a third-party library used in a connector. That sometimes is just out of our control.

lava commented 1 year ago

On Wednesday we had a discussion round together with @dominiklohmann and @jachris .

We agreed on the high-level outlines of this feature:

Related documents: https://docs.google.com/document/d/1b-zpDp796fRr1FPpObCkia2Dyuh8IEF-XaBmv5lvszs/edit#heading=h.um2utrvlnup8 https://app.excalidraw.com/s/6dBWEFf9h1l/8J1RozwXFXV

tobim commented 11 months ago

I attempted to write a proof of concept for this last night / this morning and got to a point where I can run a pipeline in a forked tenzir-node process as a whole. The core behavior change of reducing the blast radius of a crashing pipeline would be fulfilled by that. But the majority of the work is yet to be done. Notably: