Open Rajin9601 opened 4 months ago
Hi @Rajin9601 ,
Thanks for this report. Is there any chance you might be able to upgrade one minor version at a time? I.e. 0.23.X -> 0.24.X? That might help narrow down the version where a behavior change occurred.
Hi @jszwedko Yes. I did the upgrade one version at a time, and found the specific version. The problem occured when i upgrade vector agent from 27.1 to 28.0.
Also, I found out that vector agent not only consumes the rotated logs, but also cosume the same file again. For example, when I upgrade the vector agent at 18:36, vector agent consumed 18:04's log again. (it was identical to the log that was consume at 18:04)
Looking at the changelog, maybe this PR https://github.com/vectordotdev/vector/pull/15899 is relevant..?
Hi @jszwedko Yes. I did the upgrade one version at a time, and found the specific version. The problem occured when i upgrade vector agent from 27.1 to 28.0.
Also, I found out that vector agent not only consumes the rotated logs, but also cosume the same file again. For example, when I upgrade the vector agent at 18:36, vector agent consumed 18:04's log again. (it was identical to the log that was consume at 18:04)
Looking at the changelog, maybe this PR #15899 is relevant..?
That does look suspicious, but the code is expected to handle checksums calculated by the previous version of Vector (in this case v0.27.0). There could be a bug though.
I'm not seeing any other suspicious commits in that diff so I'm guessing maybe there is an issue with the one you found.
Hi, @jszwedko
While looking at the code, I may have found the reason. The checksum calculation PR (https://github.com/vectordotdev/vector/pull/15899) handles previous version's checksum by migrating previous checksum to new checksum at the FileServer's start-up. https://github.com/vectordotdev/vector/blob/695bb476d6116ba65d57a77aa0ac655af76f063a/lib/file-source/src/file_server.rs#L126-L135 I found that kubernetes_log's PathProvider doesn't load paths at the start-up (because it takes some time). So for the kubernetes_log, migration (function maybe_upgrade) will not be executed.
The simplest solution I can think of is to ‘check migration whenever a new file appears.’ It uses some CPU, but I don’t think it’s too computationally intensive. (example commit : https://github.com/Rajin9601/vector/commit/1a7b8e55c7dd7e89b557485cf84cf44bdbbe9a6b)
I want to hear your opinion on the solution.
Aha! Good find @Rajin9601 . I think you are right. I agree, doing it on new file seems reasonable. I'd be happy to see your commit as a PR.
Good! I created PR https://github.com/vectordotdev/vector/pull/21231.
A note for the community
Problem
I encountered an issue when upgrading the Vector agent image from 0.23.0 to 0.36.1 using the Helm chart.
After
helm upgrade
, I observed a spike in log ingestion. (Normally, vector agent processes around 100 logs per minute, but right after the helm upgrade, it consumed around 10000 logs per minute for a few minutes)I discovered that the agent was consuming logs that were 1-2 days old. These logs were from rotated files, with filenames ending with
0.log.20240705-010753
example:/var/log/pods/argo-rollouts_argo-rollouts-8444487787-c2dsb_fe69e7cf-cf6f-4c26-9cf3-a015d631a16d/argo-rollouts/0.log.20240705-010753
I couldn't find any relevent changes in the upgrade guides.
Additional informations
Configuration
Version
0.23.0 -> 0.36.1
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response