Open newly12 opened 5 months ago
Pinging code owners:
%s See Adding Labels via Comments if you do not have permissions to add labels yourself.
cc: @ycombinator @ChrsMark
Thank's for reporting this @newly12. I will try to have a look next week (after 8th).
Hey folks, I had a look into this. To me it seems that it is not an issue with the exclude_older_than
filter specifically but with filters in general. I guess the same would happen if we use ordering_criteria.top_n
and with a specific sequence of file changes we stop matching one file and then match it again?
plus 3 poll intervals its fingerprint is removed from the manager known file list
That seems to be the root cause. At every poll's cycle end we swift the known files by one:
So if after the 3 polls a file is not matched again it will be shifted out.
(EndConsume
on each poll cycle copies the t.previousPollFiles
to t.knownFiles[0]
through the ClosePreviousFiles()
, that's how a re-matched file makes it to the beggining of the buffer again).
Making the knownFiles
buffer size configurable at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5bdedd0c4e40cc9bb378c83442ed7f8c64a6a4ce/pkg/stanza/fileconsumer/internal/tracker/tracker.go#L42 doesn't sound like a good solution to me because it can easily break for files that are not updated for long periods.
The only way I see it being properly fixed would be to keep track of the diffs between the files that are matched with these specific time-sensitive filters. Not sure how much work this would require though. @djaglowski feel free to correct me here or if you have any ideas.
Thanks for looking into this @ChrsMark. I agree this could a problem with any filter. What it ultimately boils down to is that when we apply a filter, we are deciding not only which files to read but also which files to track. Filters were designed with reading in mind but this issue is a good example of why reading and tracking may need to be decoupled.
I think the next step to solving this would be to propose requirements describing what a user needs, in terms of filtering for the sake of tracking vs reading. Do we need two independent layers of filtering, first for tracking and second for reading? Or maybe just a flag is sufficient, which could indicate "track all files but only read those which pass the filters"?
We could design an implementation after that, but I imagine we might need the matcher to return two sets of files, one which are tracked and read, the other which are tracked but not read.
Any thoughts on what the user requirements should be here?
Removing needs triage
as this is a valid problem that needs resolved, with the understanding that there's more discussion required here.
Any thoughts on what the user requirements should be here?
As a user I would expect that once a file is matched the Collector should "remember" it.
As a user I would expect that start_at: end
implies that only fresh logs are collected.
Based on the above, an additional perspective of the issue is that the setting start_at: end
is not honored when exclude_older_than
is also set.
This means that when the Collector starts fresh and a file is not matched because its content is old, if a new log line is written to the file the Collector will match it and the whole content from the very begging will be collected. That looks a bit weird to my eyes as a user.
For subsequent polls As a user I would expect that once a file is matched the Collector should "remember" it.
For restarts/fresh-starts As a user I would expect that start_at: end implies that only fresh logs are collected.
At a high level, these are the behaviors we have already but we need requirements that address the exceptions to these behaviors.
We need to separate "matched" files into to separate concepts so that we can track those which are filtered out. The question is whether/how this should be exposed to the user.
We cannot track all files forever and users may want to filter specifically to avoid the memory/cpu impact of matching too many files, so I tend to think we need a design which gives the user full control.
@djaglowski I would like to take care of this issue. Can you assign it to me?
@VihasMakwana I think we are still in need of a design proposal before implementation. Do you have something in mind?
@djaglowski
Do you have something in mind?
Yes, I'll try my best to expand upon my idea. Correct me if I'm wrong with any assumptions @djaglowski.
TL;DR; We can utilise bloom filter This will take constant space and time as hash sizes are fixed and we rely on bitwise ops.
The requirements from a user's perspective are:
From wikipedia,
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".
Our main problem is that the filecosumer
filters out the file if it's older than exclude_older_than
, even if it was read before.
My idea, roughly:
filter = 0
for file_name in matched_files {
file_hash = hash(file_name)
if fileAge < exlude_older_than {
// continue with our flow
} else if file_hash & filter == file_hash {
// we've probably seen this file before. Don't filter it.
// Pretty low chances of an exact match.
}
// add the hash for future matches.
filter |= file_hash
}
Fingerprints are growing entities, so they won't work with bloom filter. It would cause extra time to calculate the known file based on fingerprints.
If files are rotated out of pattern?
If rotated and old the file is in the pattern?
modify_time <= exclude_older_than
and the new name will remain in our record.Bloom filter is a probabilistic data structure, so we might add files which are never read before.
fileconsumer
's core logic.But we guarantee that once a file is added, it will remain in our record and we will avoid duplication.
Working on this part. @djaglowski any recommendations on this part?
@VihasMakwana, thanks for the detailed proposal. I think the idea of using a bloom filter is interesting but I don't think it's enough to know if we've seen the file before (precisely or otherwise). We also need to recall the exact offset for each file in order to read only the new logs that it contains.
@djaglowski
That should be tackled by bloom fiilter. We will essentially calculate a hash of the file name and update the filter (i.e. do bitwise OR). During next iterations, we will calculate bitwise AND of filter
and hash.
We will always have a previously read file in MatchedFiles
and this would keep the latest offsets in the knownFiles
array.
I don't think it's enough to know if we've seen the file before (precisely or otherwise)
That's true. But it's enough to ascertain that we've NOT seen a file before.
If we use bloom filter with an additional check in fileconsumer
's core logic, we can be sure to only include a previous opened files (if fileAge
> exclude_older_than
).
I can work on a POC PR and I'll add test cases to confirm this. Would you be interested in that?
We will always have a previously read file in
MatchedFiles
and this would keep the latest offsets in theknownFiles
array.
Isn't this incompatible with the requirement to limit the memory size?
I don't think it's enough to know if we've seen the file before (precisely or otherwise)
That's true. But it's enough to ascertain that we've NOT seen a file before.
If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?
We will always have a previously read file in MatchedFiles and this would keep the latest offsets in the knownFiles array.
Isn't this incompatible with the requirement to limit the memory size?
Yes, but we can avoid it with something like, with a new config option perhaps:
if file_age > exclude_older_than && file is in filter {
if file_age > max_allowed_age {
// quite old, stop tracking.
}
}
In my opinion, we should have current behaviour enabled by default and document the memory increase the user might face, if the enable the new behaviour. There's a tradeoff here.
If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?
We rely on bloom filter only for older files, so something like this.
if we remember last known offset:
// continue reading from that offset
else
// ignore the file for now and move on.
// We will read it only when file's age reaches below exclude_older_than.
Here, we ensure that bloom filter has effect only on older files. The behaviour remains unchanged If a file is recently updated.
The chances of finding a never seen file in the filter are very low (considering we use a large enough filter). I ran some benchmarks for this using https://github.com/bits-and-blooms/bloom/ to confirm it. Memory footprint is also quite low.
If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?
We rely on bloom filter only for older files, so something like this.
if we remember last known offset: // continue reading from that offset else // ignore the file for now and move on. // We will read it only when file's age reaches below exclude_older_than.
Here, we ensure that bloom filter has effect only on older files. The behaviour remains unchanged If a file is recently updated.
The chances of finding a never seen file in the filter are very low (considering we use a large enough filter). I ran some benchmarks for this using https://github.com/bits-and-blooms/bloom/ to confirm it. Memory footprint is also quite low.
I don't see how your response answers my question. The issue at hand is that we forget about the offset for a file that hasn't been updated recently.
What you are describing as a solution sounds to me like the current behavior that we are trying to fix. Specifically this part:
// We will read it only when file's age reaches below exclude_older_than.
This is exactly the problem which the issue describes. We can't read it appropriately if we don't know the last offset. If we read from the beginning, we duplicate logs. If we read from the end, we lose logs.
I don't see how your response answers my question. The issue at hand is that we forget about the offset for a file that hasn't been updated recently.
What you are describing as a solution sounds to me like the current behavior that we are trying to fix. Specifically this part:
// We will read it only when file's age reaches below exclude_older_than.
This is exactly the problem which the issue describes. We can't read it appropriately if we don't know the last offset. If we read from the beginning, we duplicate logs. If we read from the end, we lose logs.
Quite interesting discussion, thank's @VihasMakwana @djaglowski!
Just gave some thought to it based on your discussions. Hope I'm not going quite wild (or off-topic) :):
<Brainstorming section>
Since it's obvious that there is the need to "persist" offset information of the "forgotten" files, how about leveraging an LLC like approach to store the "warm" info into the memory while the "cold" data on the disk?
Warm data is always less than cold data hence memory allocation/utilization will be less than disk usage.
The bloom filter (if it comes cheap) can also be used to determine if an access to this "caching" hierarchy is required or not, saving access io specially for the disk's accesses.
The algorithm can be like this:
0. File is detected
1. Check if the file is known already:
a. if it's known the flow continues as usual
b. if it's "new" go to step 2
2. optional: use the bloom filter to determine if the file has been met in the past (saves cycles from further accessing the "caches") [`this can be problematic upon restarts`, so I'm not sure if we could actually use it]
a. if the file is new we read it from the beginning
b. if the file has been matched in the past continue to step 3
3. Access the in-memory store/cache to get the file's offset
a. [cache hit] if file is found get its offset and continue reading it
b. [cache miss] if file is not found continue to step 4
4. Access the on-disk store/cache to get the file's offset
a. [cache hit] file is found get its offset and continue reading it
b. [cache miss] ERROR - we should not end up here. Step 2 should had prevent that.
file_storage
feature. In case it's not enabled the algorithm only relies on the in-memory store.</Brainstorming section>
@ChrsMark I thought something similar, i.e. incorporate filestorage extension.
Thanks for you algorithm Chris!
@djaglowski
We will always have a previously read file in MatchedFiles and this would keep the latest offsets in the knownFiles array.
Isn't this incompatible with the requirement to limit the memory size?
We can make use of filestorage
as mentioned by Chris.
Currently, we do:
t.knownFiles[0] -> t.knownFiles[1] -> t.knownFiles[2]
and discard previous t.knownFiles[2]
.
Instead of discarding, we can move it to a cold storage (filestorage
will be a preferred here).
This way, we would achieve:
If the user wants to limit the warm on-disk storage size as well, then we can use a ttl
based mechanism to determine if a file offset is safe to remove, based on its updated time.
If the particular file metadata in warm on-disk storage is quite old (we can introduce a config for this), we will remove that from the storage. If that file appears again, it will be read from beginning. The user needs to set the new config appropriately.
User can disable this option, if they can afford the offsets to be stored forever. (not preferred, but upto the user to decide)
If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?
There are two cases to consider:
ttl
based cleaning is enabled:
ttl
based cleaning is disabled:
If we don't find a offset, we will read it from beginning.
Bloom filter is optional.
In my original approach, I've used Bloom filter to determine if we've probably seen a file or not to save time from matching it with O(1)
complexity.
EDIT:
@djaglowski I accidentally interchanged warm
and cold
at a couple of places. My bad 🙏 .
memory storage will always be limited. on-disk storage is also be limited by default, but user has an option to keep offsets forever.
It goes without saying that we need to document all the things thoroughly.
@djaglowski Just checking in—let me know if you have any thoughts!
My thoughts:
We can make use of filestorage as mentioned by Chris. Currently, we do:
t.knownFiles[0] -> t.knownFiles[1] -> t.knownFiles[2]
and discard previoust.knownFiles[2]
.Instead of discarding, we can move it to a cold storage
This makes sense to me but we need to keep in mind that in order to reduce the memory footprint, we must interact with storage carefully. In other words, if a user decides to remember files for 100 poll intervals, we should still only ever have files from one of those poll intervals in memory at a given time (or maybe a configurable number, typically much less than 100)
A possible implementation would be to manage these archived sets of files as a ring buffer in storage, where each set of files is stored under a specific index. Instead of discarding t.knownFiles[2]
, write it to the next index and eventually roll over. e.g. Separate storage keys knownFilesArchive0
, knownFilesArchive1
, ..., knownFilesArchiveN
, roll over back to knownFilesArchive0
. The main point here is that there's a memory footprint incurred based on the number of file sets we load at a time, so we have to limit this or we aren't actually improving anything.
A consequence of the above is that this access pattern is unlike how we currently search for known files in memory, where we search for one file at a time, going backwards through sets of known files. This would mean accessing storage repeatedly for each file. (e.g. The first file requires loading 25 archived files sets before finding a match. Then the second file loads all archived file sets individually before concluding there is no match.) Instead, we should set aside files which are not matched in memory, then work through archived file sets one at a time. Load the most recent index from storage, check all the unmatched files against it. If any are found, remove them from the set and rewrite it to storage. Then load the next most recent index from storage, check the remaining unmatched files against it, etc.
My thoughts:
- We should keep this as simple as possible, at least initially.
- The bloom filter is an interesting idea but because it comes with uncertainty it also introduces complexity that may be unnecessary. Let's set it aside for now and revisit if we need additional optimization.
- A storage extension seems like a natural way to remember more files.
- I believe there always needs to be a limit to how many files can be remembered. That limit can be configured by the user to be very large, but allowing it to be infinite should require a dedicated justification.
Agree with you. We should break these changes in series of PRs.
A consequence of the above is that this access pattern is unlike how we currently search for known files in memory, where we search for one file at a time, going backwards through sets of known files. This would mean accessing storage repeatedly for each file. (e.g. The first file requires loading 25 archived files sets before finding a match. Then the second file loads all archived file sets individually before concluding there is no match.) Instead, we should set aside files which are not matched in memory, then work through archived file sets one at a time. Load the most recent index from storage, check all the unmatched files against it. If any are found, remove them from the set and rewrite it to storage. Then load the next most recent index from storage, check the remaining unmatched files against it, etc.
That makes sense. We should minimize memory footprint as much as possible.
if I understand correctly, you mean something like following:
paths = matcher.MatchFiles()
unmatched_files = []
for _, path := range paths:
fp := fingerprint(path)
if fp in tracker:
// continue as per current implementation
else:
// set aside unmatched files for later matching
unmatched_files = append(unmatched_files, path)
...
...
// Open archived filesets one by one
for i := 0 ; i < CONFIGURED_LIMIT ; i++ {
archives := persister.Get(fmt.Sprintf("knownFilesArchive%d", i))
// loop through unmatched_files
// if file in archives, remove it from archives, update the archive and read from known offset.
}
// all the remaining unmatched_files will be treated as new files
Correct me if I'm wrong @djaglowski
We ensure to access storage in an incremental manner and minimize memory footprint.
FWIW, we respect the exclude_older_than
while matching files and only include recently updated files.
@VihasMakwana that looks about like I was imagining. (We could in theory also invert the in-memory search pattern so that it's the same as for storage, but this can be a considered later.)
@djaglowski Thanks!
I’m leaning towards creating separate, incremental PRs. Here’s a general outline:
PR 1:
PR 2
Final PR:
fileconsumer
.@VihasMakwana that plan sounds good. Thanks for taking this on.
@djaglowski If you have a moment, I’ve made further progress on the second PR. The first PR is currently under review, and I’ve created the second draft PR from the branch of the first PR.
Hi @djaglowski
I'd like your opinion on something. Currently, I've placed the reader creation for unmatched files within the archive module. However, it seems to be complicating things. From what I understand, it's better to keep the reader logic separate from other modules. What do you think?
I'll update the PR if you agree. Let me know
Component(s)
pkg/stanza
What happened?
Description
old logs could be collected again when
exclude_older_than
is enabledSteps to Reproduce
put in a for loop to generate new logs every few seconds(longer than the
exclude_older_than
duration). i.e.from the collector logs
basically a file isn't updated after
exclude_older_than
duration plus 3 poll intervals its fingerprint is removed from themanager
known file list, later when new content is added to the file, it is considered to be a new file and will be read from the beginning again.Expected Result
only new logs are collected
Actual Result
old logs are collected again
Collector version
v0.99.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
No response
Additional context
pkg/stanza
was updated with previous fix https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32683 included.