Open KrishnaJyothika opened 11 months ago
š hello and thanks for the report and describing your use case well.
Would you be able to provide minimal reproducible configuration files that reproduce your issue, this would aid in investigation.
Thanks~
Hi @neuronull,
Thanks for looking into the issue
We've enabled Azure data streaming which pushes files to our Azure blob storage, then we've mounted the Azure blob storage to our vector pod using below config and we're reading those files using file_source
We're assuming the problem should be in checkpoint file, because volumeClaimTemplate is to make our data persistent, so that once we stop the vector and start it again it shouldn't read the already files.
Attaching the config file vector.txt
Please let me know if anything else is required from my end
The oldest_first
option doesn't do what you are wanting it to do. This option just causes the file source to focus on reading current files before moving on to discover new files to read. The order that those new files are read in is fairly arbitrary from what I can tell, and will just be whatever order the glob::glob_with
function returns the files in.
What I believe you are asking for is a new feature whereby we sort those files by modification date before processing the files.
It is worth noting that this feature would apply to all Vector installations that use the file source. volumeClaimTemplates
and Kubernetes mentioned in this issue are red herrings.
Hi @StephenWakely,
Our issue is not about the modified time, it is related to file creation time only.
Files are reading in reverse order, for example there are files present at 3:55, 3:56, 3:57, 3:58, 3:59, 4:00 when we starting reading the files first vector is reading the 4:00 file then 3:59, 3:58, 3:57, 3:56, 3:55 files (reverse order).
Attaching screenshots for reference
Files present
Vector file source logs
Please help us to fix the issue.
A note for the community
Problem
Hi,
Vector is running in our Kubernetes cluster, below is our use case
Vector as producer
Vector as consumer
Now the problem we're facing is with file reading sequence, here we're observing two different cases
Case-1: Without volumeClaimTemplates for data_dir in Vector statefulset When we don't have any volumeClaimTemplates for data_dir (which contains the checkpoint file), vector reading file in sequence (i.e. from oldest to newest file) But the problem we're seeing when we stop vector consumer and bring it up again, vector is reading all files again which are already read.
Case-2: With volumeClaimTemplates for data_dir in Vector statefulset When we've volumeClaimTemplates for data_dir, when we stop vector consumer and bring it up again, we're can see it is not reading the older files again it is reading only new files. But the problem is vector is reading files in reverse order (i.e. from newest to oldest file) not from oldest to newest. This is the main problem we're facing.
Note: In both the cases oldest_first is set to true.
Please help us fix the issue, basically we need files to be read in sequence (oldest to newest) and when stop the vector consumer and bring it up again, it shouldn't again read the already read files.
Configuration
No response
Version
0.31.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response