Closed metcox closed 5 years ago
@metcox thanks for reporting this.
In the case of filtering, this is the expected behaviour from the Processor API point of view, although I understand is not what users would expect.
Kafka docs states that transform
:
Marks the stream for data re-partitioning: Applying a grouping or a join after transform will result in re-partitioning of the records. If possible use transformValues instead, which will not cause data re-partitioning.
Unfortunately, if we change the implementation from transform
to transformValues
it will break the filtering functionality by emitting kv pairs with value=null
.
In this case, I'd recommend to use the native filtering (without tracing) if grouping or joining is happening after using transform. I will add a warning to the API docs.
I've found that peek
and mark
has a similar issues, but those can be refactored into transformValues
instead. I will create a patch to fix this.
Unfortunately, if we change the implementation from transform to transformValues it will break the filtering functionality by emitting kv pairs with value=null
As a workaround one could put .filter((key, value) -> value != null)
just after ?
If the filtering operation is potentially expensive it's worth being able to trace it without risking the repartition performance hit, WDYT @jeqo ?
That's one of the 2 options I see:
transform
and warn about partitioningtransformValues
and require an additional filter right
after to remove nulls.I'm considering the option 2, but with a different name, like markFilter
,
to avoid confusion.
And we could keep the option 1 with a warning. As transformer
is used, I
think it is enough for users to be aware of repartitioning.
On Thu, 11 Jul 2019, 09:40 Jorg Heymans, notifications@github.com wrote:
Unfortunately, if we change the implementation from transform to transformValues it will break the filtering functionality by emitting kv pairs with value=null
As a workaround one could put .filter((key, value) -> value != null) just after ?
If the filtering operation is potentially expensive it's worth being able to trace it without risking the repartition performance hit, WDYT @jeqo https://github.com/jeqo ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/942?email_source=notifications&email_token=ABPE6XIXP2IJJEA4L5AS7G3P63PYJA5CNFSM4H7Q6SB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZVZ3QA#issuecomment-510369216, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPE6XO3CG2YMDCVPQH3AWLP63PYJANCNFSM4H7Q6SBQ .
Unfortunately, if we change the implementation from transform to transformValues it will break the filtering functionality by emitting kv pairs with value=null
Right, thank you for pointing that out.
That's one of the 2 options I see: 1. Keep it as
transform
and warn about partitioning 2. Filter withtransformValues
and require an additional filter right after to remove nulls. I'm considering the option 2, but with a different name, likemarkFilter
, to avoid confusion. And we could keep the option 1 with a warning. Astransformer
is used, I think it is enough for users to be aware of repartitioning.
Having option 2 and keeping option 1 is okay for me.
Describe the Bug
The instrumentation of some Kafka Streams operations with KafkaStreamsTracing results in an unnecessary call to "KStream.transform ()". This can trigger unwanted and expensive repartitioning. The operations involved are: filter(), filterNot(), peek(), and mark(). These operations return a TransformerSupplier where ValueTransformerWithKeySupplier would be more appropriate. The associated call to KStream.transformValues() will not mark the stream for a repartitioning.
Steps to Reproduce
The following stream with Brave instrumentation generates 2 sub-topolgies :
Produces the topology:
Expected Behaviour
A single topology is expected like with the uninstrumented stream.
Produces the topolgy
A sample project is available at https://github.com/metcox/brave-kafka-streams-topology.git