magelisk commented 1 day ago

Summary

I propose a new type of vertex that that is BOTH a udsink AND a source. This would allow data to go into a node and, some arbitrary time later, be emitted back out. I typically call this a vertex with "asyncronous map results" but I think that causes confusion with the sync vs async map gRPC implementation

Using what exists today, this behavior would be implemented as

Sink to receive data. It actually just stores data for processing.
Eventually the sink completes processing and it sends the data to some non-ISB buffer
A source exists to listen to that non-ISB buffer and source pushes it back through the pipeline.

This is functional, but a bit awkward and inefficient as it introduces more buffer layers. If we had a single vertex that could implement both interfaces, then data could be emitted as it's available even if the original input message has "completed". Alternatively, if implemented as a map, a message could optionally be emitted synchronously to the normal flow.

At the most simple implementation this would be literally two sidecars for sink and source handling, and then a single main contain that must implement both interfaces where each would talk to the appropriate sidecar. Doing this, I think implementation would be almost trivially at the CRD validation and controller level. SDK updates would help facilitate seamless integration as I think some ports and server-info might have to be deconflicted?

A slightly more streamlined solution would be a new gRPC service that explicitly handles both interfaces and a single numa container to handle receiving data simple to how a source does. I suspect this is more complicated with various duplications of sink & source behaviors in the core logic?

In both situations, I'm indifferent to whether a source transfer would be supported but I suspect doing so would be good for consistency across concepts?

Use Cases

One example

Data triangulation. If multiple different sensors detect some event, each can feed their data in and a fusion algorithm can be configured to wait X seconds for as many sources to come in as possible. And then, Once it gets enough or times out, it will output a message with information about the data. The data can be then made available back to the pipeline via the source interface. This makes for a much cleaner pipeline interface and avoid
- NOTE: Details about how data goes from input --> processing --> output are left to application. In the simplest form it's just a go-routine listening on channels and outputting as it has information.
Another, example is data smoothing. Sets of points come in from a sensor like GPS, which can bounce a bit due to obstructions and general imprecisions, and the algorithm will output smoothed curves of the paths. But to do this is helps to have a few history and a few 'future' points for a given measurement. So the algo would keep a small history and output points after the next 5 have been received. This invokes an expected delayed but with less jumpy data which can generally be more beneficial that quicker-but-wrong.

Quirks

This has some inherent awkwardness in terms of tracing a message through the system as this vertex would break that normal chain-of-custody. In my opinion this is an issue for users to resolve as I don't think numaflow explicitly cares

Similarly, this all implies that the application is probably maintaining some level of state to do this. Typically probably a bad pattern, but in this case a necessary risk and again, I think onus is on the implementer to deal with this.

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

kohlisid commented 1 day ago

Hey @magelisk We have a monovertex type which is a single processing entity having a source, transformer and sink https://numaflow.numaproj.io/core-concepts/monovertex/ Does that help you?

magelisk commented 1 day ago

In my mind, this is almost the inverse of MonoVertex. MV is it's own, self contained thing replacing a pipeline and it only responds when a message is sent into system. Here, I'm looking for a standard map node in existing pipelines EXCEPT that the data goes in and "some time later" data comes out - even if no more data is coming in (perhaps ESPECIALLY if no data is coming in and it needs to 'close' a data processing like ending a GPS route)

numaproj / numaflow

Vertex for Delayed Map Results #2235

Summary

Use Cases

Quirks