vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.63k stars 1.55k forks source link

File reading sequence issue in Vector #18946

Open KrishnaJyothika opened 11 months ago

KrishnaJyothika commented 11 months ago

A note for the community

Problem

Hi,

Vector is running in our Kubernetes cluster, below is our use case

Vector as producer

  1. Scraping metrics using prometheus_scrape source
  2. Pushing scraped metrics to Azure blob storage using azure_blob sink (one file per min)

Vector as consumer

  1. Reading files from Azure blob storage using file_source
  2. Transforming log events from file_source to metric events using remap
  3. And push the metrics to Mimir

Now the problem we're facing is with file reading sequence, here we're observing two different cases

Case-1: Without volumeClaimTemplates for data_dir in Vector statefulset When we don't have any volumeClaimTemplates for data_dir (which contains the checkpoint file), vector reading file in sequence (i.e. from oldest to newest file) But the problem we're seeing when we stop vector consumer and bring it up again, vector is reading all files again which are already read.

Case-2: With volumeClaimTemplates for data_dir in Vector statefulset When we've volumeClaimTemplates for data_dir, when we stop vector consumer and bring it up again, we're can see it is not reading the older files again it is reading only new files. But the problem is vector is reading files in reverse order (i.e. from newest to oldest file) not from oldest to newest. This is the main problem we're facing.

Note: In both the cases oldest_first is set to true.

Please help us fix the issue, basically we need files to be read in sequence (oldest to newest) and when stop the vector consumer and bring it up again, it shouldn't again read the already read files.

Configuration

No response

Version

0.31.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

neuronull commented 11 months ago

šŸ‘‹ hello and thanks for the report and describing your use case well.

Would you be able to provide minimal reproducible configuration files that reproduce your issue, this would aid in investigation.

Thanks~

KrishnaJyothika commented 11 months ago

Hi @neuronull,

Thanks for looking into the issue

We've enabled Azure data streaming which pushes files to our Azure blob storage, then we've mounted the Azure blob storage to our vector pod using below config and we're reading those files using file_source

image

We're assuming the problem should be in checkpoint file, because volumeClaimTemplate is to make our data persistent, so that once we stop the vector and start it again it shouldn't read the already files.

Attaching the config file vector.txt

Please let me know if anything else is required from my end

StephenWakely commented 11 months ago

The oldest_first option doesn't do what you are wanting it to do. This option just causes the file source to focus on reading current files before moving on to discover new files to read. The order that those new files are read in is fairly arbitrary from what I can tell, and will just be whatever order the glob::glob_with function returns the files in.

What I believe you are asking for is a new feature whereby we sort those files by modification date before processing the files.

It is worth noting that this feature would apply to all Vector installations that use the file source. volumeClaimTemplates and Kubernetes mentioned in this issue are red herrings.

KrishnaJyothika commented 11 months ago

Hi @StephenWakely,

Our issue is not about the modified time, it is related to file creation time only.

Files are reading in reverse order, for example there are files present at 3:55, 3:56, 3:57, 3:58, 3:59, 4:00 when we starting reading the files first vector is reading the 4:00 file then 3:59, 3:58, 3:57, 3:56, 3:55 files (reverse order).

Attaching screenshots for reference

Files present image

Vector file source logs image

Please help us to fix the issue.