Open timfehr opened 1 year ago
@timfehr thank you for the report! Could you please share with us additional information about your case:
metrics.topic_lag_metric = false
? My guess is that internal implementation of lag reporting could be kinda inefficient (I am an author of it - sorry :) . You could test actual lags with an external to Vector kafka-exporter
.Observed behavior is definitely not acceptable so it would be nice if we would be able to fix it.
Hi @zamazan4ik,
metrics.topic_lag_metric
to true
just to double check and reproduce the issue again (See the increase of the lag in the screenshot on the right side). After about 10 Minutes I just changed the setting to false
as requested from you and the lag decreased.
The lags were captured with the kafka-exporter
from strimzi.
Thanks for the reproduction!
Vector reads from 1 topic with 3 partitions.
How many topics and partitions at all are created on your Kafka? Probably, the current Vector implementation for metrics iterates over all of them, not just used by Vector.
How many topics and partitions at all are created on your Kafka? Probably, the current Vector implementation for metrics iterates over all of them, not just used by Vector.
26 topics with 3 partitions each.
Could you please check, does Vector expose lag metrics only for Vector-specific topic/partitions or for all of them?
Could you please check, does Vector expose lag metrics only for Vector-specific topic/partitions or for all of them?
Is does expose the lag metrics only for the configured topic, but with 4 partiton_id (-1, 0, 1, 2).
Hm... for just 4 partitions it's too big lag (from my expectation). Need to be investigated deeper. Right now I could suggest just disabling lag reporting.
Could you maybe explain real quick in a few words, why vector reports the lag for 4 partitions instead of 3 or what does the partition_id: -1
means?
Could you maybe explain real quick in a few words, why vector reports the lag for 4 partitions instead of 3 or what does the partition_id: -1 means?
Honestly, I don't know what it means, we need to investigate it as well. Maybe checking kafka-exporter
sources could help here.
Could you maybe explain real quick in a few words, why vector reports the lag for 4 partitions instead of 3 or what does the partition_id: -1 means?
Honestly, I don't know what it means, we need to investigate it as well. Maybe checking
kafka-exporter
sources could help here.
I see this -1
partition only in the metrics reported by vector. Strimzi reports the correct 3 partitions.
Does Strimzi use https://github.com/danielqsj/kafka_exporter or something else?
Does Strimzi use https://github.com/danielqsj/kafka_exporter or something else?
Yes, they use it.
https://strimzi.io/docs/operators/latest/overview.html#metrics-overview_str
To summarize what I'm reading in the above:
internal_metrics
enabled, the kafka source is reading from the cluster without lag.internal_metrics
enabled, Vector starts reading with significant lag.metrics.topic_lag_metric = false
with internal_metrics
enabled eliminates the lag.Does this accurately reflect the conversation so far?
Given this, a few questions come to mind:
Note: The partition ID of -1 is an "internal UA/UnAssigned partition" according to the librdkafka docs on statistics. We should probably be skipping that partition too, and possibly any negative partition ID.
Does this accurately reflect the conversation so far?
Yep.
Can we filter the lag metrics on only the topics that we are actually reading from?
I think it's possible to do. Didn't check the code but likely it will require passing more information to the statistics subsystem like Vector-related topic and partition ids.
Should we default the lag metrics to "off" to avoid this problem on other larger clusters?
It's already false
according to the documentation: https://vector.dev/docs/reference/configuration/sources/kafka/#metrics.topic_lag_metric . Maybe we need to update the documentation with a note about possible performance degradation until we fix the current implementation.
Would it be possible to convert all of these metrics to registered metrics to avoid the re-registration overhead on each emit?
I think yes. My only concern here is that my last time working with new metrics was a strange problem when a registered metric didn't have all the required labels (AFAIK, it was AMQP source). But I hope that was just a mistake somewhere with my implementation. Another thing to think about - if we change the amount of Vector-related partitions in runtime, we need to register a new metric. From my experience, such things are not so easy to implement (it always looks like an ugly HashMap
with pre-registered metrics. Maybe we could improve the situation somehow in this area.
We should probably be skipping that partition too, and possibly any negative partition ID.
Agreed. We definitely need to implement this kind of filtering and it would be easy to implement.
Does this accurately reflect the conversation so far?
Yes.
Given this, a few questions come to mind:
- Can we filter the lag metrics on only the topics that we are actually reading from?
I only see the metric for the configured topic in the kafka source. So I can not comment on this, but I see @zamazan4ik already did.
Note: The partition ID of -1 is an "internal UA/UnAssigned partition" according to the librdkafka docs on statistics. We should probably be skipping that partition too, and possibly any negative partition ID.
Also, sometimes it is possible to have for actual assigned partitions negative values, which is probably caused by timing issues while getting the offsets. The kafka-exporter which was already mentioned here has an open PR which seems the fix the issue. Maybe something like this is also possible here or a simple filter to set these values at lowest to 0
. To not confuse people with negativ values.
Are these enhancements something you would be interested in taking on @zamazan4ik, as you have had a hand in the lag metrics implementation?
I do not think I have enough time for this right now :(
A note for the community
Problem
Hi vector team and community, I open this issue on behalf of the UNIBERG GmbH.
In our environment we discovered some unusual behaviour after we enabled the internal metrics with the prometheus exporter as sink. As soon as the internal metrics were enabled we saw a significant increase in the kafka consumer lag for this vector instance and topic. After we disabled the vector internal metrics again, the lag decreased. What made us wondering is that we did not see any changes in the amount of messages read for the given topic, monitored by the strimzi/kafka metrics.
In the following image you can see the consumer lag reported from strimzi/kafka. About 09:48am the metrics were enabled, where we have a significant increase for the consumer lag, but it remained stable at this level. About 10:04am we restarted the service, to made sure vector is handeling the data correctly and decrease the lag in general. Which is the case, but again only to a certain level, which is higher than before the internal metrics were enabled. At 10:09am we disabled the metrics again and the consumer lag decreased to the level it was before.
Further you can see no significant change in messages read per second from the topic. Please note only the yellow line. The different timestamps are due to a repetition of the test later to investigate it a bit further, where this screenshot was created, because we did not in the first run, to get our system into a known stable state again.
Also vector reported this high lags per topic partition.
What we already did is increase the resources the vector-service is allowed to use, which did not change the behaviour.
For us this seems like a incorrect behaviour. Of course the internal metrics come with an overhead of resources, but this high increase seems incorrect.
If there are more information needed, please tell us, we are happy to provide them.
Configuration
Version
vector 0.28.1 (x86_64-unknown-linux-gnu ff15924 2023-03-06)
Debug Output
No response
Example Data
No response
Additional Context
Vector is running in Kubernetes.
References
No response