old logs could be collected again when `exclude_older_than` is enabled

newly12 commented 5 months ago

Component(s)

pkg/stanza

What happened?

Description

old logs could be collected again when exclude_older_than is enabled

Steps to Reproduce

put in a for loop to generate new logs every few seconds(longer than the exclude_older_than duration). i.e.

$ cat /dev/null > 1.log; while :; do echo `date` |tee -a 1.log; sleep 35; done

from the collector logs

$ cat /tmp/otel.log | grep -E -w 'LogsExporter|Body'
2024-04-28T17:30:09.969+0800    info    LogsExporter    {"kind": "exporter", "data_type": "logs", "name": "logging", "resource logs": 1, "log records": 2}
Body: Str(Sun Apr 28 17:29:34 CST 2024)
Body: Str(Sun Apr 28 17:30:09 CST 2024)
2024-04-28T17:30:44.971+0800    info    LogsExporter    {"kind": "exporter", "data_type": "logs", "name": "logging", "resource logs": 1, "log records": 3}
Body: Str(Sun Apr 28 17:29:34 CST 2024)
Body: Str(Sun Apr 28 17:30:09 CST 2024)
Body: Str(Sun Apr 28 17:30:44 CST 2024)
2024-04-28T17:31:19.970+0800    info    LogsExporter    {"kind": "exporter", "data_type": "logs", "name": "logging", "resource logs": 1, "log records": 4}
Body: Str(Sun Apr 28 17:29:34 CST 2024)
Body: Str(Sun Apr 28 17:30:09 CST 2024)
Body: Str(Sun Apr 28 17:30:44 CST 2024)
Body: Str(Sun Apr 28 17:31:19 CST 2024)
2024-04-28T17:31:55.171+0800    info    LogsExporter    {"kind": "exporter", "data_type": "logs", "name": "logging", "resource logs": 1, "log records": 5}
Body: Str(Sun Apr 28 17:29:34 CST 2024)
Body: Str(Sun Apr 28 17:30:09 CST 2024)
Body: Str(Sun Apr 28 17:30:44 CST 2024)
Body: Str(Sun Apr 28 17:31:19 CST 2024)
Body: Str(Sun Apr 28 17:31:54 CST 2024)

basically a file isn't updated after exclude_older_than duration plus 3 poll intervals its fingerprint is removed from the manager known file list, later when new content is added to the file, it is considered to be a new file and will be read from the beginning again.

Expected Result

only new logs are collected

Actual Result

old logs are collected again

Collector version

v0.99.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

exporters:
  logging:
    loglevel: debug

receivers:
  filelog:
    include: [ /tmp/1.log ]
    exclude_older_than: 30s

service:
  telemetry:
    logs:
      level: "debug"
  pipelines:
    logs:
      receivers:
        - filelog
      exporters:
        - logging

Log output

No response

Additional context

pkg/stanza was updated with previous fix https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32683 included.

github-actions[bot] commented 5 months ago

Pinging code owners:

pkg/stanza: @djaglowski

%s See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski commented 5 months ago

cc: @ycombinator @ChrsMark

ChrsMark commented 4 months ago

Thank's for reporting this @newly12. I will try to have a look next week (after 8th).

ChrsMark commented 4 months ago

Hey folks, I had a look into this. To me it seems that it is not an issue with the exclude_older_than filter specifically but with filters in general. I guess the same would happen if we use ordering_criteria.top_n and with a specific sequence of file changes we stop matching one file and then match it again?

plus 3 poll intervals its fingerprint is removed from the manager known file list

That seems to be the root cause. At every poll's cycle end we swift the known files by one:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5bdedd0c4e40cc9bb378c83442ed7f8c64a6a4ce/pkg/stanza/fileconsumer/internal/tracker/tracker.go#L110-L115

So if after the 3 polls a file is not matched again it will be shifted out.

(EndConsume on each poll cycle copies the t.previousPollFiles to t.knownFiles[0] through the ClosePreviousFiles(), that's how a re-matched file makes it to the beggining of the buffer again).

Making the knownFiles buffer size configurable at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5bdedd0c4e40cc9bb378c83442ed7f8c64a6a4ce/pkg/stanza/fileconsumer/internal/tracker/tracker.go#L42 doesn't sound like a good solution to me because it can easily break for files that are not updated for long periods.

The only way I see it being properly fixed would be to keep track of the diffs between the files that are matched with these specific time-sensitive filters. Not sure how much work this would require though. @djaglowski feel free to correct me here or if you have any ideas.

djaglowski commented 4 months ago

Thanks for looking into this @ChrsMark. I agree this could a problem with any filter. What it ultimately boils down to is that when we apply a filter, we are deciding not only which files to read but also which files to track. Filters were designed with reading in mind but this issue is a good example of why reading and tracking may need to be decoupled.

I think the next step to solving this would be to propose requirements describing what a user needs, in terms of filtering for the sake of tracking vs reading. Do we need two independent layers of filtering, first for tracking and second for reading? Or maybe just a flag is sufficient, which could indicate "track all files but only read those which pass the filters"?

We could design an implementation after that, but I imagine we might need the matcher to return two sets of files, one which are tracked and read, the other which are tracked but not read.

Any thoughts on what the user requirements should be here?

crobert-1 commented 4 months ago

Removing needs triage as this is a valid problem that needs resolved, with the understanding that there's more discussion required here.

ChrsMark commented 4 months ago

Any thoughts on what the user requirements should be here?

For subsequent polls

As a user I would expect that once a file is matched the Collector should "remember" it.

For restarts/fresh-starts

As a user I would expect that start_at: end implies that only fresh logs are collected.

Based on the above, an additional perspective of the issue is that the setting start_at: end is not honored when exclude_older_than is also set. This means that when the Collector starts fresh and a file is not matched because its content is old, if a new log line is written to the file the Collector will match it and the whole content from the very begging will be collected. That looks a bit weird to my eyes as a user.

djaglowski commented 4 months ago

For subsequent polls As a user I would expect that once a file is matched the Collector should "remember" it.

For restarts/fresh-starts As a user I would expect that start_at: end implies that only fresh logs are collected.

At a high level, these are the behaviors we have already but we need requirements that address the exceptions to these behaviors.

We need to separate "matched" files into to separate concepts so that we can track those which are filtered out. The question is whether/how this should be exposed to the user.

We cannot track all files forever and users may want to filter specifically to avoid the memory/cpu impact of matching too many files, so I tend to think we need a design which gives the user full control.

VihasMakwana commented 2 months ago

@djaglowski I would like to take care of this issue. Can you assign it to me?

djaglowski commented 2 months ago

@VihasMakwana I think we are still in need of a design proposal before implementation. Do you have something in mind?

VihasMakwana commented 2 months ago

@djaglowski

Do you have something in mind?

Yes, I'll try my best to expand upon my idea. Correct me if I'm wrong with any assumptions @djaglowski.

TL;DR; We can utilise bloom filter This will take constant space and time as hash sizes are fixed and we rely on bitwise ops.

The requirements from a user's perspective are:

If the consumer sees a file, it should remember it and should never forget it.
We should do it in a memory and time efficient manner. We cannot afford to store files forever.

What is bloom filter?

From wikipedia,

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".

How would bloom filter help with this bug?

Our main problem is that the filecosumer filters out the file if it's older than exclude_older_than, even if it was read before. My idea, roughly:

filter = 0
for file_name in matched_files {
   file_hash = hash(file_name) 
    if fileAge < exlude_older_than {

       // continue with our flow

    } else if file_hash & filter == file_hash  {

      // we've probably seen this file before. Don't filter it.
      // Pretty low chances of an exact match.

    }

     // add the hash for future matches.
     filter |= file_hash
}

Should we use file name or fingerprint?

Fingerprints are growing entities, so they won't work with bloom filter. It would cause extra time to calculate the known file based on fingerprints.

What about rotated out files?

If files are rotated out of pattern?

Shouldn't be our concern (correct me if I'm wrong @djaglowski)

If rotated and old the file is in the pattern?

A new file will be created with the old name. Our hash has the record of this file, so this should work as expected.
If the old file (with a new name) is in the pattern, we would read it as modify_time <= exclude_older_than and the new name will remain in our record.

Notes.

Bloom filter is a probabilistic data structure, so we might add files which are never read before.

This is concerning. We don't want to read old files which were never read before.
We can avoid this by adding some extra checks in fileconsumer's core logic.

But we guarantee that once a file is added, it will remain in our record and we will avoid duplication.

Persistence of filter across the collector restarts.

Working on this part. @djaglowski any recommendations on this part?

djaglowski commented 2 months ago

@VihasMakwana, thanks for the detailed proposal. I think the idea of using a bloom filter is interesting but I don't think it's enough to know if we've seen the file before (precisely or otherwise). We also need to recall the exact offset for each file in order to read only the new logs that it contains.

VihasMakwana commented 2 months ago

@djaglowski That should be tackled by bloom fiilter. We will essentially calculate a hash of the file name and update the filter (i.e. do bitwise OR). During next iterations, we will calculate bitwise AND of filter and hash.

We will always have a previously read file in MatchedFiles and this would keep the latest offsets in the knownFiles array.

I don't think it's enough to know if we've seen the file before (precisely or otherwise)

That's true. But it's enough to ascertain that we've NOT seen a file before. If we use bloom filter with an additional check in fileconsumer's core logic, we can be sure to only include a previous opened files (if fileAge > exclude_older_than).

I can work on a POC PR and I'll add test cases to confirm this. Would you be interested in that?

djaglowski commented 2 months ago

We will always have a previously read file in MatchedFiles and this would keep the latest offsets in the knownFiles array.

Isn't this incompatible with the requirement to limit the memory size?

I don't think it's enough to know if we've seen the file before (precisely or otherwise)

That's true. But it's enough to ascertain that we've NOT seen a file before.

If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?

VihasMakwana commented 2 months ago

We will always have a previously read file in MatchedFiles and this would keep the latest offsets in the knownFiles array.

Isn't this incompatible with the requirement to limit the memory size?

Yes, but we can avoid it with something like, with a new config option perhaps:

if file_age > exclude_older_than && file is in filter {
      if file_age > max_allowed_age {
         // quite old, stop tracking.
      }
}

In my opinion, we should have current behaviour enabled by default and document the memory increase the user might face, if the enable the new behaviour. There's a tradeoff here.

If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?

We rely on bloom filter only for older files, so something like this.

if we remember last known offset:
  // continue reading from that offset
else
  // ignore the file for now and move on. 
  // We will read it only when file's age reaches below exclude_older_than.

Here, we ensure that bloom filter has effect only on older files. The behaviour remains unchanged If a file is recently updated.

The chances of finding a never seen file in the filter are very low (considering we use a large enough filter). I ran some benchmarks for this using https://github.com/bits-and-blooms/bloom/ to confirm it. Memory footprint is also quite low.

djaglowski commented 2 months ago

If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?

We rely on bloom filter only for older files, so something like this.
if we remember last known offset:
  // continue reading from that offset
else
  // ignore the file for now and move on. 
  // We will read it only when file's age reaches below exclude_older_than.
Here, we ensure that bloom filter has effect only on older files. The behaviour remains unchanged If a file is recently updated.

The chances of finding a never seen file in the filter are very low (considering we use a large enough filter). I ran some benchmarks for this using https://github.com/bits-and-blooms/bloom/ to confirm it. Memory footprint is also quite low.

I don't see how your response answers my question. The issue at hand is that we forget about the offset for a file that hasn't been updated recently.

What you are describing as a solution sounds to me like the current behavior that we are trying to fix. Specifically this part:

// We will read it only when file's age reaches below exclude_older_than.

This is exactly the problem which the issue describes. We can't read it appropriately if we don't know the last offset. If we read from the beginning, we duplicate logs. If we read from the end, we lose logs.

ChrsMark commented 2 months ago

I don't see how your response answers my question. The issue at hand is that we forget about the offset for a file that hasn't been updated recently.

What you are describing as a solution sounds to me like the current behavior that we are trying to fix. Specifically this part:

// We will read it only when file's age reaches below exclude_older_than.

This is exactly the problem which the issue describes. We can't read it appropriately if we don't know the last offset. If we read from the beginning, we duplicate logs. If we read from the end, we lose logs.

Quite interesting discussion, thank's @VihasMakwana @djaglowski!

Just gave some thought to it based on your discussions. Hope I'm not going quite wild (or off-topic) :):

<Brainstorming section>

Since it's obvious that there is the need to "persist" offset information of the "forgotten" files, how about leveraging an LLC like approach to store the "warm" info into the memory while the "cold" data on the disk?

Warm data is always less than cold data hence memory allocation/utilization will be less than disk usage.

The bloom filter (if it comes cheap) can also be used to determine if an access to this "caching" hierarchy is required or not, saving access io specially for the disk's accesses.

The algorithm can be like this:

0. File is detected
1. Check if the file is known already:
    a. if it's known the flow continues as usual
    b. if it's "new" go to step 2
2. optional: use the bloom filter to determine if the file has been met in the past (saves cycles from further accessing the "caches") [`this can be problematic upon restarts`, so I'm not sure if we could actually use it]
    a. if the file is new we read it from the beginning
    b. if the file has been matched in the past continue to step 3
3. Access the in-memory store/cache to get the file's offset
    a. [cache hit] if file is found get its offset and continue reading it
    b. [cache miss] if file is not found continue to step 4
4. Access the on-disk store/cache to get the file's offset
    a. [cache hit] file is found get its offset and continue reading it
    b. [cache miss] ERROR - we should not end up here. Step 2 should had prevent that.

Implementation details

in memory store's size must be configurable (with reasonable default) by the user
on disk store's size must be configurable
Disk store can be optional. Like the file_storage feature. In case it's not enabled the algorithm only relies on the in-memory store.
There is proper handling of the data during the accesses. We need to ensure proper updates, move data from cold to warm after an access etc.
Users get access to the size of these stores as well as they can decide if they want to use both of them or none. This allow them to tune according to their needs and find the golden numbers for their setups/use-cases.

</Brainstorming section>

VihasMakwana commented 2 months ago

@ChrsMark I thought something similar, i.e. incorporate filestorage extension.

Thanks for you algorithm Chris!

@djaglowski

We will always have a previously read file in MatchedFiles and this would keep the latest offsets in the knownFiles array.

Isn't this incompatible with the requirement to limit the memory size?

We can make use of filestorage as mentioned by Chris. Currently, we do: t.knownFiles[0] -> t.knownFiles[1] -> t.knownFiles[2] and discard previous t.knownFiles[2].

Instead of discarding, we can move it to a cold storage (filestorage will be a preferred here). This way, we would achieve:

Limitting memory size.
Remembering the last known offset of an old file.

If the user wants to limit the ~~warm~~ on-disk storage size as well, then we can use a ttl based mechanism to determine if a file offset is safe to remove, based on its updated time. If the particular file metadata in ~~warm~~ on-disk storage is quite old (we can introduce a config for this), we will remove that from the storage. If that file appears again, it will be read from beginning. The user needs to set the new config appropriately.

User can disable this option, if they can afford the offsets to be stored forever. (not preferred, but upto the user to decide)

If the Bloom filter tells us that we've probably seen a file before, but we don't remember how much of it we've already read, then what do we do with the file?

There are two cases to consider:

when ttl based cleaning is enabled:
- This would happen only when the file offset has been removed from both cold and warm storage.
- In this case, the file will be considered a new file and it will be read from beginning. it's a price we have to pay if want to restrict the storage size.
- In my opinion, this should be the default with a reasonable value.
when ttl based cleaning is disabled:
- This should never happen (unless user messes with the disk file) and we ensure to store the offsets on disk.
- We will assume that the file is new as we haven't found it in both, warm and cold storages.

If we don't find a offset, we will read it from beginning.

Bloom filter is optional. In my original approach, I've used Bloom filter to determine if we've probably seen a file or not to save time from matching it with O(1) complexity.

EDIT: @djaglowski I accidentally interchanged warm and cold at a couple of places. My bad 🙏 .

memory storage will always be limited. on-disk storage is also be limited by default, but user has an option to keep offsets forever.

VihasMakwana commented 2 months ago

It goes without saying that we need to document all the things thoroughly.

VihasMakwana commented 2 months ago

@djaglowski Just checking in—let me know if you have any thoughts!

djaglowski commented 2 months ago

My thoughts:

We should keep this as simple as possible, at least initially.
The bloom filter is an interesting idea but because it comes with uncertainty it also introduces complexity that may be unnecessary. Let's set it aside for now and revisit if we need additional optimization.
A storage extension seems like a natural way to remember more files.
I believe there always needs to be a limit to how many files can be remembered. That limit can be configured by the user to be very large, but allowing it to be infinite should require a dedicated justification.

We can make use of filestorage as mentioned by Chris. Currently, we do: t.knownFiles[0] -> t.knownFiles[1] -> t.knownFiles[2] and discard previous t.knownFiles[2].

Instead of discarding, we can move it to a cold storage

This makes sense to me but we need to keep in mind that in order to reduce the memory footprint, we must interact with storage carefully. In other words, if a user decides to remember files for 100 poll intervals, we should still only ever have files from one of those poll intervals in memory at a given time (or maybe a configurable number, typically much less than 100)

A possible implementation would be to manage these archived sets of files as a ring buffer in storage, where each set of files is stored under a specific index. Instead of discarding t.knownFiles[2], write it to the next index and eventually roll over. e.g. Separate storage keys knownFilesArchive0, knownFilesArchive1, ..., knownFilesArchiveN, roll over back to knownFilesArchive0. The main point here is that there's a memory footprint incurred based on the number of file sets we load at a time, so we have to limit this or we aren't actually improving anything.

A consequence of the above is that this access pattern is unlike how we currently search for known files in memory, where we search for one file at a time, going backwards through sets of known files. This would mean accessing storage repeatedly for each file. (e.g. The first file requires loading 25 archived files sets before finding a match. Then the second file loads all archived file sets individually before concluding there is no match.) Instead, we should set aside files which are not matched in memory, then work through archived file sets one at a time. Load the most recent index from storage, check all the unmatched files against it. If any are found, remove them from the set and rewrite it to storage. Then load the next most recent index from storage, check the remaining unmatched files against it, etc.

VihasMakwana commented 1 month ago

My thoughts:

We should keep this as simple as possible, at least initially.

The bloom filter is an interesting idea but because it comes with uncertainty it also introduces complexity that may be unnecessary. Let's set it aside for now and revisit if we need additional optimization.

A storage extension seems like a natural way to remember more files.

I believe there always needs to be a limit to how many files can be remembered. That limit can be configured by the user to be very large, but allowing it to be infinite should require a dedicated justification.

Agree with you. We should break these changes in series of PRs.

A consequence of the above is that this access pattern is unlike how we currently search for known files in memory, where we search for one file at a time, going backwards through sets of known files. This would mean accessing storage repeatedly for each file. (e.g. The first file requires loading 25 archived files sets before finding a match. Then the second file loads all archived file sets individually before concluding there is no match.) Instead, we should set aside files which are not matched in memory, then work through archived file sets one at a time. Load the most recent index from storage, check all the unmatched files against it. If any are found, remove them from the set and rewrite it to storage. Then load the next most recent index from storage, check the remaining unmatched files against it, etc.

That makes sense. We should minimize memory footprint as much as possible.

if I understand correctly, you mean something like following:


paths = matcher.MatchFiles()
unmatched_files = []

for _, path :=  range paths:
    fp := fingerprint(path)
    if fp in tracker:
        // continue as per current implementation 
    else:
        // set aside unmatched files for later matching
        unmatched_files = append(unmatched_files, path)

...
...

// Open archived filesets one by one
for i := 0 ; i < CONFIGURED_LIMIT ; i++ {
    archives := persister.Get(fmt.Sprintf("knownFilesArchive%d", i))

    // loop through unmatched_files
    // if file in archives, remove it from archives, update the archive and read from known offset.
} 

// all the remaining unmatched_files will be treated as new files

Correct me if I'm wrong @djaglowski We ensure to access storage in an incremental manner and minimize memory footprint. FWIW, we respect the exclude_older_than while matching files and only include recently updated files.

djaglowski commented 1 month ago

@VihasMakwana that looks about like I was imagining. (We could in theory also invert the in-memory search pattern so that it's the same as for storage, but this can be a considered later.)

VihasMakwana commented 1 month ago

@djaglowski Thanks!

I’m leaning towards creating separate, incremental PRs. Here’s a general outline:

PR 1:
- Develop a new internal module for managing archive storage.
- Start by adding just method signatures and interfaces (if needed).
PR 2
- Add the actual logic.
- This module will:
- Match files with archive storage.
- Manage rewriting and persistence operations.
- Return known metadata if available.
- We can split this into smaller PRs if necessary.
Final PR:
- Integrate the archive manager into fileconsumer.
- Add test cases to cover a wide range of scenarios.
- Add benchmarks

djaglowski commented 1 month ago

@VihasMakwana that plan sounds good. Thanks for taking this on.

VihasMakwana commented 3 weeks ago

@djaglowski If you have a moment, I’ve made further progress on the second PR. The first PR is currently under review, and I’ve created the second draft PR from the branch of the first PR.

here's a snapshot https://github.com/VihasMakwana/opentelemetry-collector-contrib/compare/create-archive-storage...VihasMakwana:opentelemetry-collector-contrib:create-archive-storage-2.0?expand=1

VihasMakwana commented 2 weeks ago

Hi @djaglowski

I'd like your opinion on something. Currently, I've placed the reader creation for unmatched files within the archive module. However, it seems to be complicating things. From what I understand, it's better to keep the reader logic separate from other modules. What do you think?

I'll update the PR if you agree. Let me know

open-telemetry / opentelemetry-collector-contrib