numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io
Apache License 2.0
1.01k stars 98 forks source link

Document "Dropping Messages" concept #1740

Open th0ger opened 1 month ago

th0ger commented 1 month ago

In SDK implementations and event time filter examples there exists MessageToDrop(...) (go) and to_drop(...) (python).

It is not obvious why dropping messages must be done by adding dropped message with timestamps. (I have my guesses though.)

Please clarify in core numaflow docs the overall concept of:

I'll be happy to review a doc PR to check if it makes sense.


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

KeranYang commented 1 month ago

Hey @th0ger, thanks for raising the issue. Yes, we don't have dedicated documentation for MessageToDrop, we will look into improving our documentation.

To quickly answer your questions.

why dropping messages must be done by adding dropped message with timestamps.

At the backend, on the SDK side, dropping a message is done by giving the message a DROP tag. On the platform side, when a message is received from UDF with a DROP tag, the platform does NOT write the message to the out edge. The main use cases of dropping are data filtering and conditional forwarding.

A timestamp is required ONLY for the source transformer because, at the source, the timestamp is required for watermark calculation. For other types of user-defined functions, the timestamp is not required.