JHK commented 5 years ago

To do some deeper introspection on what is going on when receiving or publishing messages it would be useful to have an instrumentation interface compatible to Active Support Instrumentation, default might be just a NullInstrumenter which is just discarding information. To have an idea what might be actually useful to instrument be inspired by ruby-kafka:

message producing
message delivery
message polling
join/leave consumer group
(re-)assign partitions within consumer group
offset changes
consumer heartbeat
connection updates
probably more...

thijsc commented 5 years ago

Thanks for asking. And also for using this gem in racecar! :-)

I have considered integrating AS instrumentation, but given the nature of the underlying C lib I don't see a way in which that approach works well. Did you see the statistics callback we added? https://github.com/appsignal/rdkafka-ruby/pull/40

I just noticed the docs on rubydocs are not properly regenerated for some reason, so you might have missed that.

thijsc commented 5 years ago

Also some callbacks would definitively make sense to add, especially for partition assignment changes.

mensfeld commented 5 years ago

It would be really good if the instrumentation engine was not AS Notif based but rather AS Notif compatbile so other engines can be plugged in (like dry-monitor that we use in Karafka)

JHK commented 5 years ago

@mensfeld I updated the ticket description to be more clear to not rely on ActiveSupport, but rather use the same interface for instrumentation.

JHK commented 5 years ago

The statistics endpoint goes into the right direction, but is not what I meant with this issue. It is about being able to connect the instrumentation e.g. to the datadog agent to be able to introspect what happened on each and every request (that got recorded). There it is quite handy to know which branch the code took, how often and what time it took.

thijsc commented 5 years ago

I've been thinking about this quite a bit, especially since I work on a monitoring product all day.

The thing is that I'm not sure there actually is something to measure. Librdkafka does a lot of buffering in the background. Actually consuming a message from Ruby pops something of an internal buffer, which is always super fast. I think what you're talking about mainly happens inside librdkafka. The stats for that are present in the statistics callback.

Can you give an example of where you'd like to see hooks? What would these hooks really allow you to measure?

JHK commented 5 years ago

Looking at the instrumentation of ruby-kafka it provides a notification one can subscribe to whenever a message produce gets called. It provides some meta information (code).

This can then be used for example in the datadog-agent or (like in my case) to time_bandits to determine the call frequency per request or similar metrics.

thijsc commented 5 years ago

Right, I think I understand the use case better. You're not so much interested in the performance of the produce call. But you do want to get hooks and see the volume?

mensfeld commented 5 years ago

@thijsc I am interested in the produce performance. Having the instrumentation for it would allow also for the volume at least for DD using the increment over the messages sent to a particular topic.

thijsc commented 5 years ago

I am interested in the produce performance.

What do you see yourself measuring exactly?

mensfeld commented 5 years ago

What do you see yourself measuring exactly?

How many messages can I send per second depending on the ack level plus where do they go (to which topic).

mensfeld commented 5 years ago

@thijsc any reason for the statistics_callback to be global? What if I would want to have different callback handling in various consumers/producers?

thijsc commented 5 years ago

@thijsc any reason for the statistics_callback to be global? What if I would want to have different callback handling in various consumers/producers?

82 was opened for this question.

thijsc commented 5 years ago

I'm trying to get this done, but not making a lot of progress because I don't have a clear picture in my mind what this looks like. I can see how events for assignment changes and so forth can work.

I can also see how emitting an event for producing a message could work. I don't see how emitting an event for a delivered message would be useful. AS notifications assumes that things happen in sync, that's not going to be the case here. I think you're going to get a lot of out of order events.

I also don't see how we can do hooks for message delivery. The C lib pops them of a buffer, so when they arrive on the Ruby side says little on how the network is doing for example. The stats in the statistics callback do tell us that. Maybe I'm missing a useful use case here?

I think we need to spend some time coming up with a spec of which events should be emitted and write up some use cases on how one would benefit from them. That'll make it a more manageable project to get this done.

@JHK and @mensfeld which events do you think should be emitted and could you write up a short description of when they would trigger and which information they would emit?

JHK commented 5 years ago

I cannot say what exactly needs to be in such a message, but rather have a look at what racecar already provides:

Producing a message: https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/producer.rb#L220-L228
Message delivery: https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/producer.rb#L246-L257
Error on a Topic: https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/producer.rb#L467-L470

Those are instrumentations built from the need to measure details within racecar. The statistics callback already provides a lot of those infos, but not the hook itself. So I'd suggest to include what makes sense to you in that hook. If one needs more, then we can still extend using individual PRs. But the general idea of hooks is present by then and the parameters can then be discussed on a case by case basis.

dasch commented 5 years ago

We have a pretty clear need to measure then number of successful / failed message deliveries per producer process.

mensfeld commented 5 years ago

@dasch but you can do that yourself now: https://github.com/karafka/waterdrop/pull/106/files#diff-d179c7dee2064c1622d2d3da2b03c44dR32

thijsc commented 5 years ago

Thanks all for the input! I'm going to work on it.

emersonpriceiv commented 3 years ago

Hello! I'm curious what became of this work. We're currently going through the process of updating Racecar and we've been leveraging the consumer heartbeat instrumentation for monitoring our consumer health. Are there any plans to implement something similar? If not we would love to see it!

mensfeld commented 3 years ago

@emersonpriceiv the current API allows you to do that. Please see the PR above for waterdrop where there's a full instrumentation support.

thijsc commented 1 year ago

Closing this one. I think it's not clear how we can improve on rdkafka's internal capabilities.

karafka / rdkafka-ruby

Instrumentation support #54

82 was opened for this question.