vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.7k stars 1.57k forks source link

Increased kafka consumer lag with internal metrics enabled. #17057

Open timfehr opened 1 year ago

timfehr commented 1 year ago

A note for the community

Problem

Hi vector team and community, I open this issue on behalf of the UNIBERG GmbH.

In our environment we discovered some unusual behaviour after we enabled the internal metrics with the prometheus exporter as sink. As soon as the internal metrics were enabled we saw a significant increase in the kafka consumer lag for this vector instance and topic. After we disabled the vector internal metrics again, the lag decreased. What made us wondering is that we did not see any changes in the amount of messages read for the given topic, monitored by the strimzi/kafka metrics.

In the following image you can see the consumer lag reported from strimzi/kafka. About 09:48am the metrics were enabled, where we have a significant increase for the consumer lag, but it remained stable at this level. About 10:04am we restarted the service, to made sure vector is handeling the data correctly and decrease the lag in general. Which is the case, but again only to a certain level, which is higher than before the internal metrics were enabled. At 10:09am we disabled the metrics again and the consumer lag decreased to the level it was before.

Bildschirm­foto 2023-03-23 um 10 45 21

Further you can see no significant change in messages read per second from the topic. Please note only the yellow line. The different timestamps are due to a repetition of the test later to investigate it a bit further, where this screenshot was created, because we did not in the first run, to get our system into a known stable state again.

Bildschirm­foto 2023-04-05 um 11 39 59

Also vector reported this high lags per topic partition.

Bildschirm­foto 2023-04-05 um 11 48 28

What we already did is increase the resources the vector-service is allowed to use, which did not change the behaviour.

For us this seems like a incorrect behaviour. Of course the internal metrics come with an overhead of resources, but this high increase seems incorrect.

If there are more information needed, please tell us, we are happy to provide them.

Configuration

[sources.kafka]
type = "kafka"
bootstrap_servers = "${KAFKA_IN_BOOTSTRAP_SERVERS:?Missing kafka in bootstrap servers}"
group_id = "${KAFKA_CONSUMER_GROUP_ID:?Missing kafka group id}"
auto_offset_reset = "${KAFKA_NEW_CONSUMER_OFFSET:-latest}"
decoding.codec = "${KAFKA_IN_CODEC:-bytes}"
topics = ["${KAFKA_IN_TOPIC:?Missing kafka input topic}"]
metrics.topic_lag_metric = true

[sources.internal_metrics]
type = "internal_metrics"
scrape_interval_secs = 30 

[sinks.prometheus_exporter]
type = "prometheus_exporter"
inputs = [ "*metrics" ]
address = "${VECTOR_PROMETHEUS_ENDPOINT:-0.0.0.0:8888}"

Version

vector 0.28.1 (x86_64-unknown-linux-gnu ff15924 2023-03-06)

Debug Output

No response

Example Data

No response

Additional Context

Vector is running in Kubernetes.

References

No response

zamazan4ik commented 1 year ago

@timfehr thank you for the report! Could you please share with us additional information about your case:

  1. Could you please perform the same test but with metrics.topic_lag_metric = false? My guess is that internal implementation of lag reporting could be kinda inefficient (I am an author of it - sorry :) . You could test actual lags with an external to Vector kafka-exporter.
  2. Could you please share information regarding topic/partitions count in your setup? If you have a large amount of them - the current implementation could be laggy.

Observed behavior is definitely not acceptable so it would be nice if we would be able to fix it.

timfehr commented 1 year ago

Hi @zamazan4ik,

  1. I enabled the metrics again with the mentioned setting metrics.topic_lag_metric to true just to double check and reproduce the issue again (See the increase of the lag in the screenshot on the right side). After about 10 Minutes I just changed the setting to false as requested from you and the lag decreased. Bildschirmfoto 2023-04-17 um 12 45 01

The lags were captured with the kafka-exporter from strimzi.

  1. Vector reads from 1 topic with 3 partitions.
zamazan4ik commented 1 year ago

Thanks for the reproduction!

Vector reads from 1 topic with 3 partitions.

How many topics and partitions at all are created on your Kafka? Probably, the current Vector implementation for metrics iterates over all of them, not just used by Vector.

timfehr commented 1 year ago

How many topics and partitions at all are created on your Kafka? Probably, the current Vector implementation for metrics iterates over all of them, not just used by Vector.

26 topics with 3 partitions each.

zamazan4ik commented 1 year ago

Could you please check, does Vector expose lag metrics only for Vector-specific topic/partitions or for all of them?

timfehr commented 1 year ago

Could you please check, does Vector expose lag metrics only for Vector-specific topic/partitions or for all of them?

Is does expose the lag metrics only for the configured topic, but with 4 partiton_id (-1, 0, 1, 2).

zamazan4ik commented 1 year ago

Hm... for just 4 partitions it's too big lag (from my expectation). Need to be investigated deeper. Right now I could suggest just disabling lag reporting.

timfehr commented 1 year ago

Could you maybe explain real quick in a few words, why vector reports the lag for 4 partitions instead of 3 or what does the partition_id: -1 means?

zamazan4ik commented 1 year ago

Could you maybe explain real quick in a few words, why vector reports the lag for 4 partitions instead of 3 or what does the partition_id: -1 means?

Honestly, I don't know what it means, we need to investigate it as well. Maybe checking kafka-exporter sources could help here.

timfehr commented 1 year ago

Could you maybe explain real quick in a few words, why vector reports the lag for 4 partitions instead of 3 or what does the partition_id: -1 means?

Honestly, I don't know what it means, we need to investigate it as well. Maybe checking kafka-exporter sources could help here.

I see this -1 partition only in the metrics reported by vector. Strimzi reports the correct 3 partitions.

zamazan4ik commented 1 year ago

Does Strimzi use https://github.com/danielqsj/kafka_exporter or something else?

timfehr commented 1 year ago

Does Strimzi use https://github.com/danielqsj/kafka_exporter or something else?

Yes, they use it.

https://strimzi.io/docs/operators/latest/overview.html#metrics-overview_str

bruceg commented 1 year ago

To summarize what I'm reading in the above:

  1. Without internal_metrics enabled, the kafka source is reading from the cluster without lag.
  2. With internal_metrics enabled, Vector starts reading with significant lag.
  3. Setting metrics.topic_lag_metric = false with internal_metrics enabled eliminates the lag.
  4. This cluster has 26 topics with 3 partitions each, resulting in an additional 78 gauges updated each time statistics are updated, which is every 1.0 second by default.

Does this accurately reflect the conversation so far?

Given this, a few questions come to mind:

  1. Can we filter the lag metrics on only the topics that we are actually reading from?
  2. Should we default the lag metrics to "off" to avoid this problem on other larger clusters?
  3. Would it be possible to convert all of these metrics to registered metrics to avoid the re-registration overhead on each emit?

Note: The partition ID of -1 is an "internal UA/UnAssigned partition" according to the librdkafka docs on statistics. We should probably be skipping that partition too, and possibly any negative partition ID.

zamazan4ik commented 1 year ago

Does this accurately reflect the conversation so far?

Yep.

Can we filter the lag metrics on only the topics that we are actually reading from?

I think it's possible to do. Didn't check the code but likely it will require passing more information to the statistics subsystem like Vector-related topic and partition ids.

Should we default the lag metrics to "off" to avoid this problem on other larger clusters?

It's already false according to the documentation: https://vector.dev/docs/reference/configuration/sources/kafka/#metrics.topic_lag_metric . Maybe we need to update the documentation with a note about possible performance degradation until we fix the current implementation.

Would it be possible to convert all of these metrics to registered metrics to avoid the re-registration overhead on each emit?

I think yes. My only concern here is that my last time working with new metrics was a strange problem when a registered metric didn't have all the required labels (AFAIK, it was AMQP source). But I hope that was just a mistake somewhere with my implementation. Another thing to think about - if we change the amount of Vector-related partitions in runtime, we need to register a new metric. From my experience, such things are not so easy to implement (it always looks like an ugly HashMap with pre-registered metrics. Maybe we could improve the situation somehow in this area.

We should probably be skipping that partition too, and possibly any negative partition ID.

Agreed. We definitely need to implement this kind of filtering and it would be easy to implement.

timfehr commented 1 year ago

Does this accurately reflect the conversation so far?

Yes.

Given this, a few questions come to mind:

  1. Can we filter the lag metrics on only the topics that we are actually reading from?

I only see the metric for the configured topic in the kafka source. So I can not comment on this, but I see @zamazan4ik already did.

Note: The partition ID of -1 is an "internal UA/UnAssigned partition" according to the librdkafka docs on statistics. We should probably be skipping that partition too, and possibly any negative partition ID.

Also, sometimes it is possible to have for actual assigned partitions negative values, which is probably caused by timing issues while getting the offsets. The kafka-exporter which was already mentioned here has an open PR which seems the fix the issue. Maybe something like this is also possible here or a simple filter to set these values at lowest to 0. To not confuse people with negativ values.

bruceg commented 1 year ago

Are these enhancements something you would be interested in taking on @zamazan4ik, as you have had a hand in the lag metrics implementation?

zamazan4ik commented 1 year ago

I do not think I have enough time for this right now :(