Open magelisk opened 1 day ago
Hey @magelisk We have a monovertex type which is a single processing entity having a source, transformer and sink https://numaflow.numaproj.io/core-concepts/monovertex/ Does that help you?
In my mind, this is almost the inverse of MonoVertex. MV is it's own, self contained thing replacing a pipeline and it only responds when a message is sent into system. Here, I'm looking for a standard map node in existing pipelines EXCEPT that the data goes in and "some time later" data comes out - even if no more data is coming in (perhaps ESPECIALLY if no data is coming in and it needs to 'close' a data processing like ending a GPS route)
Summary
I propose a new type of vertex that that is BOTH a udsink AND a source. This would allow data to go into a node and, some arbitrary time later, be emitted back out. I typically call this a vertex with "asyncronous map results" but I think that causes confusion with the sync vs async map gRPC implementation
Using what exists today, this behavior would be implemented as
This is functional, but a bit awkward and inefficient as it introduces more buffer layers. If we had a single vertex that could implement both interfaces, then data could be emitted as it's available even if the original input message has "completed". Alternatively, if implemented as a map, a message could optionally be emitted synchronously to the normal flow.
At the most simple implementation this would be literally two sidecars for sink and source handling, and then a single main contain that must implement both interfaces where each would talk to the appropriate sidecar. Doing this, I think implementation would be almost trivially at the CRD validation and controller level. SDK updates would help facilitate seamless integration as I think some ports and server-info might have to be deconflicted?
A slightly more streamlined solution would be a new gRPC service that explicitly handles both interfaces and a single numa container to handle receiving data simple to how a source does. I suspect this is more complicated with various duplications of sink & source behaviors in the core logic?
In both situations, I'm indifferent to whether a source transfer would be supported but I suspect doing so would be good for consistency across concepts?
Use Cases
One example
Data triangulation. If multiple different sensors detect some event, each can feed their data in and a fusion algorithm can be configured to wait X seconds for as many sources to come in as possible. And then, Once it gets enough or times out, it will output a message with information about the data. The data can be then made available back to the pipeline via the source interface. This makes for a much cleaner pipeline interface and avoid
Another, example is data smoothing. Sets of points come in from a sensor like GPS, which can bounce a bit due to obstructions and general imprecisions, and the algorithm will output smoothed curves of the paths. But to do this is helps to have a few history and a few 'future' points for a given measurement. So the algo would keep a small history and output points after the next 5 have been received. This invokes an expected delayed but with less jumpy data which can generally be more beneficial that quicker-but-wrong.
Quirks
This has some inherent awkwardness in terms of tracing a message through the system as this vertex would break that normal chain-of-custody. In my opinion this is an issue for users to resolve as I don't think numaflow explicitly cares
Similarly, this all implies that the application is probably maintaining some level of state to do this. Typically probably a bad pattern, but in this case a necessary risk and again, I think onus is on the implementer to deal with this.
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.