tulios / kafkajs

A modern Apache Kafka client for node.js
https://kafka.js.org
MIT License
3.75k stars 527 forks source link

Enhance the instrumentations events to support metrics collection #1706

Open ilia-beliaev-miro opened 3 months ago

ilia-beliaev-miro commented 3 months ago

Is your feature request related to a problem? Please describe. We are using different languages in different service to connect to Kafka to consume and produce messages. For all this different clients we want to have an observability dashboard with all the same metrics provided by each service there.

The java kafka client library has many metrics, and we were able to collect some of them thank to the instrumentation events. However, some of them in the current state of kafkajs are very difficult (runtime override of many different functions or classes) or impossible to collect because the information for those metrics is not exposed via Instrumentation Events. In case of the java client, these metrics collection is part of the library itself.

Describe the solution you'd like To collect this metrics either enhancements are needed for existing instrumentation events or new events could be added. Metrics that we can't collect at the moment:

Additional context

2m commented 1 month ago

We are also in a similar situation (microservices written for Java and Node) and we are working on having unified dashboards. Our dashboards only have 3 graphs for Kafka - message consume/produce rate and consumer lag.

Currently we show batch (as opposed to message) consume/produce rate. Which is a bit different from message rates, but is still file.

The one metric that we miss the most is the fetch_manager_records_lag - this shows if there are any outsdanding records after the last fetch. This is especially useful for services that have high throughput. In such case the broker is always reporting consumer lag, because there are always messages inflight. That makes it cumbersome to use the metric from the broker when making scaling decisions.

Tracking fetch_manager_records_lag on the client side is then needed in order to notice when the consumer is not able to keep up with the incoming messages.