vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.26k stars 1.6k forks source link

kubernetes_log source consumes already consumed logs again when agent upgrade from 0.27.1 to newer version #20798

Open Rajin9601 opened 4 months ago

Rajin9601 commented 4 months ago

A note for the community

Problem

I encountered an issue when upgrading the Vector agent image from 0.23.0 to 0.36.1 using the Helm chart.

After helm upgrade, I observed a spike in log ingestion. (Normally, vector agent processes around 100 logs per minute, but right after the helm upgrade, it consumed around 10000 logs per minute for a few minutes)

I discovered that the agent was consuming logs that were 1-2 days old. These logs were from rotated files, with filenames ending with 0.log.20240705-010753 example: /var/log/pods/argo-rollouts_argo-rollouts-8444487787-c2dsb_fe69e7cf-cf6f-4c26-9cf3-a015d631a16d/argo-rollouts/0.log.20240705-010753

Screenshot 2024-07-05 at 6 01 46 PM

I couldn't find any relevent changes in the upgrade guides.

Additional informations

Configuration

role: Agent

image:
  tag: 0.36.1-distroless-libc

customConfig:
  data_dir: /vector-data-dir
  api:
    enabled: true
    address: 127.0.0.1:8686
    playground: false
  sources:
    kubernetes_logs:
      type: kubernetes_logs
... (transforms and sinks settings)

Version

0.23.0 -> 0.36.1

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 4 months ago

Hi @Rajin9601 ,

Thanks for this report. Is there any chance you might be able to upgrade one minor version at a time? I.e. 0.23.X -> 0.24.X? That might help narrow down the version where a behavior change occurred.

Rajin9601 commented 4 months ago

Hi @jszwedko Yes. I did the upgrade one version at a time, and found the specific version. The problem occured when i upgrade vector agent from 27.1 to 28.0.

Also, I found out that vector agent not only consumes the rotated logs, but also cosume the same file again. For example, when I upgrade the vector agent at 18:36, vector agent consumed 18:04's log again. (it was identical to the log that was consume at 18:04)

Looking at the changelog, maybe this PR https://github.com/vectordotdev/vector/pull/15899 is relevant..?

jszwedko commented 4 months ago

Hi @jszwedko Yes. I did the upgrade one version at a time, and found the specific version. The problem occured when i upgrade vector agent from 27.1 to 28.0.

Also, I found out that vector agent not only consumes the rotated logs, but also cosume the same file again. For example, when I upgrade the vector agent at 18:36, vector agent consumed 18:04's log again. (it was identical to the log that was consume at 18:04)

Looking at the changelog, maybe this PR #15899 is relevant..?

That does look suspicious, but the code is expected to handle checksums calculated by the previous version of Vector (in this case v0.27.0). There could be a bug though.

jszwedko commented 4 months ago

I'm not seeing any other suspicious commits in that diff so I'm guessing maybe there is an issue with the one you found.

Rajin9601 commented 2 months ago

Hi, @jszwedko

While looking at the code, I may have found the reason. The checksum calculation PR (https://github.com/vectordotdev/vector/pull/15899) handles previous version's checksum by migrating previous checksum to new checksum at the FileServer's start-up. https://github.com/vectordotdev/vector/blob/695bb476d6116ba65d57a77aa0ac655af76f063a/lib/file-source/src/file_server.rs#L126-L135 I found that kubernetes_log's PathProvider doesn't load paths at the start-up (because it takes some time). So for the kubernetes_log, migration (function maybe_upgrade) will not be executed.

The simplest solution I can think of is to ‘check migration whenever a new file appears.’ It uses some CPU, but I don’t think it’s too computationally intensive. (example commit : https://github.com/Rajin9601/vector/commit/1a7b8e55c7dd7e89b557485cf84cf44bdbbe9a6b)

I want to hear your opinion on the solution.

jszwedko commented 2 months ago

Aha! Good find @Rajin9601 . I think you are right. I agree, doing it on new file seems reasonable. I'd be happy to see your commit as a PR.

Rajin9601 commented 2 months ago

Good! I created PR https://github.com/vectordotdev/vector/pull/21231.