Closed michMartineau closed 7 months ago
Hi @michMartineau !
Apologies for the delay in reproducing this. I tried today but unfortunately wasn't able to reproduce this issue using a local Kafka instance and Logstash to send data. Do you run into it consistently? If so, could you try enabling Kafka debug logging to try to troubleshoot? This can be done by doing both of the following:
VECTOR_LOG="librdkafka=trace,rdkafka::client=debug"
librdkafka_options.debug = "all"
to the kafka
sinkHi @jszwedko, Thank for the tips, I will test that today. In our non production environments, the issue appears randomly. In our production environments, where there are more activities, the issue was more constant. So we rollbacked the deployment for them. I didn't find a root cause.
Hi @jszwedko I've reproduced the issue with the configuration you have given to me. I've also added the env variable RUST_BACKTRACE: full here the log file: vector-kafka-sink-issue-20240207.log
I wonder if it is not more an issue with internal tracing than a kafka issue.
Just a remark VECTOR_LOG="librdkafka=trace,rdkafka::client=debug"
seems not a valid log level configuration. So the log file above contains only results from librdkafka_options.debug = "all"
Thanks for that @michMartineau ! That is interesting. To confirm, you are only seeing this issue on 0.35.0 and never saw it on 0.34.2?
Assuming that's true, two changes jump out to me that were in the 0.35.0 release:
The tracing changes seem like a more likely culprit based on the backtrace, but I'm not seeing any smoking guns.
Indeed i've never seen this issue with 0.34.2.
Indeed i've never seen this issue with 0.34.2.
Would you be willing to try out a custom image? I can create one that reverts the tracing changes to see if that fixes it for you.
Good idea, I can try.
@michMartineau I published a custom build that is v0.35.0 with the kafka tracing changes reverted. Could you try one of these images and let me know how it goes?
https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.35.0.custom.059fb1b
Hi @jszwedko , I've tested it and I was wrong. It failed. My bad. Last time, I tested 0.34.2 it was ok. but as it seems related to activity load. So I've tested all versions from 0.33.0 to 0.35
Chart | Binary | Result |
---|---|---|
0.30.0 | 0.35.0 | KO |
0.29.1 | 0.34.2 | KO |
0.29.0 | 0.34.1 | KO |
0.28.0 | 0.34.0 | KO |
0.27.0 | 0.33.1 | KO |
0.26.0 | 0.33.0 | OK |
So it seems due to a change with 0.33.1 here logs with the version 0.33.1 vector-0.33.1.log
I wonder if it could be related to this commit https://github.com/vectordotdev/vector/commit/fa09de37c735bec57a67d78641b9db13c17097d8
Thanks for doing that analysis @michMartineau ! Let me spin up a custom build that is v0.33.1 without that commit and we can see how that looks. I agree it seems like the most suspicious.
Pushed another custom build. Can you try one of https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.33.1.custom.a1b6913 ?
Yes sure. Thx again, I can't today. but I will do next Monday
Hi, I've tested several time the version 0.33.1 and the custom version 0.33.1 without this commit.
Thanks for testing! We'll take a closer look at that commit and see if anything obviously jumps out.
looks like this has been shown up more frequently now, Thanks for pointing to the exact commit. awaiting more findings @jszwedko . Thanks @michMartineau !
any luck with identifying and fixing this bug?
it seems to be during rebalancing or while some change at kafka which trigger change in vector.(this is just guess based on closer debug on the issue. I may be wrong as well but this is inital understanding)
We think we may have found the issue, fixing it in https://github.com/vectordotdev/vector/pull/20001. I created another set of custom docker images to try: https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.35.1.custom.34bc8a. @michMartineau (or others) do you mind giving those a try and letting us know if they fix the issue for you?
@jszwedko sure I will test that.
@jszwedko I did some tests yesterday and today. I've switched between the 0.36.0 and the custom version. With 0.36.0-alpine, I saw some restarts. nothing with custom version. so look good to me
@jszwedko I did some tests yesterday and today. I've switched between the 0.36.0 and the custom version. With 0.36.0-alpine, I saw some restarts. nothing with custom version. so look good to me
Thanks for verifying!
A note for the community
Problem
We've upgraded our vector pods from v0.30.0 (chart 0.26.0) to 0.35.0 (chart: 0.30.0). Vector fails and the pod restart (crash loop backoff).
It seems related to the kafka sink (see related logs). I've tested the 0.34.2 (chart: 0.29.3) successfully.
Configuration
Version
0.35.0
Debug Output
Example Data
No response
Additional Context
No response
References
No response