numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io
Apache License 2.0
1.29k stars 116 forks source link

Vertex for Delayed Map Results #2235

Open magelisk opened 1 day ago

magelisk commented 1 day ago

Summary

I propose a new type of vertex that that is BOTH a udsink AND a source. This would allow data to go into a node and, some arbitrary time later, be emitted back out. I typically call this a vertex with "asyncronous map results" but I think that causes confusion with the sync vs async map gRPC implementation

Using what exists today, this behavior would be implemented as

This is functional, but a bit awkward and inefficient as it introduces more buffer layers. If we had a single vertex that could implement both interfaces, then data could be emitted as it's available even if the original input message has "completed". Alternatively, if implemented as a map, a message could optionally be emitted synchronously to the normal flow.

At the most simple implementation this would be literally two sidecars for sink and source handling, and then a single main contain that must implement both interfaces where each would talk to the appropriate sidecar. Doing this, I think implementation would be almost trivially at the CRD validation and controller level. SDK updates would help facilitate seamless integration as I think some ports and server-info might have to be deconflicted?

A slightly more streamlined solution would be a new gRPC service that explicitly handles both interfaces and a single numa container to handle receiving data simple to how a source does. I suspect this is more complicated with various duplications of sink & source behaviors in the core logic?

In both situations, I'm indifferent to whether a source transfer would be supported but I suspect doing so would be good for consistency across concepts?

Use Cases

One example

Quirks

This has some inherent awkwardness in terms of tracing a message through the system as this vertex would break that normal chain-of-custody. In my opinion this is an issue for users to resolve as I don't think numaflow explicitly cares

Similarly, this all implies that the application is probably maintaining some level of state to do this. Typically probably a bad pattern, but in this case a necessary risk and again, I think onus is on the implementer to deal with this.


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

kohlisid commented 1 day ago

Hey @magelisk We have a monovertex type which is a single processing entity having a source, transformer and sink https://numaflow.numaproj.io/core-concepts/monovertex/ Does that help you?

magelisk commented 1 day ago

In my mind, this is almost the inverse of MonoVertex. MV is it's own, self contained thing replacing a pipeline and it only responds when a message is sent into system. Here, I'm looking for a standard map node in existing pipelines EXCEPT that the data goes in and "some time later" data comes out - even if no more data is coming in (perhaps ESPECIALLY if no data is coming in and it needs to 'close' a data processing like ending a GPS route)