[receiver/filelogreceiver] offset could be taken as additional factor for fingerprinting

sumo-drosiek commented 2 years ago

Describe the bug We have a situation when rotated file begins the same way like the previous one (so fingerprint is the same), but new file is much behind the offset. So, while we are writing to the new file, nothing happens until we reach the offset. Such cases could be handled by filelog receiver. There can happen if there are two same fingerprinted files being read in parallel. My personal thought is to add some kind of hashes to fingerprint

Steps to reproduce

logs/file_1.txt

{"log": "THIS IS MY LOG FROM FILE1....................................................................\n", "stream": "stdout"}

logs/file_2.txt

{"log": "THIS IS MY LOG FROM FILE2\n", "stream": "stdout"}

config.yaml

receivers:
 filelog:
   include:
     - logs/file.txt
   fingerprint_size: 16
exporters:
 logging:
   logLevel: debug
service:
 pipelines:
   logs:
     receivers:
       - filelog
     exporters:
       - logging

run otc

./otelcol-contrib --config config.yaml 2>&1 | tee -a out2.log

steps:

cd logs
ln -s some_file.txt file.txt
cp file_1.txt some_file.txt
cp file_2.txt some_file.txt
echo '{"log": "THIS IS MY LOG FROM FILE2\n", "stream": "stdout"}' >> some_file.txt
echo '{"log": "THIS IS MY LOG FROM FILE2\n", "stream": "stdout"}' >> some_file.txt

What did you expect to see? All three logs (or at least an offset error)

What did you see instead? part of the log (starting from offset)

Body: HIS IS MY LOG FROM FILE2

What version did you use? v0.52.0

What config did you use? config.yaml

receivers:
  filelog:
    include:
      - logs/file.txt
    fingerprint_size: 16
exporters:
  logging:
    logLevel: debug
service:
  pipelines:
    logs:
      receivers:
        - filelog
      exporters:
        - logging

Environment OS: Ubuntu 22.04 LTS

Additional context N/A

djaglowski commented 2 years ago

This is tricky. I don't think we can use the offset as identifying information, since it changes over time. Possibly we can use it for secondary logic to invalidate fingerprint matches though.

I'm trying to identify an example where this could work while not actually duplicating logs in some cases as well.

Let's look at an example. (Let's say the fingerprint length is very short here, for simplicity.)

File 1: 
- Fingerprint = "foo\nbar\n"
- Offset = 100000 (we've clearly been consuming this file for a while)

File 2a:
- Fingerprint = "foo\n"
- Offset = 4 (we hit EOF after the first line)

File 2b:
- Fingerprint = "foo\nbar\n"
- Offset = 8 (we looked again later and hit OEF exactly at the full fingerprint)

Is it possible to say for sure at this point that we should read File 2a/b? What if File 1 is being rotated and we have so far only observed the first lines of a copy of it?

My personal thought is to add some kind of hashes to fingerprint

I'm not sure there's anything to hash that would not degrade our ability to match files based on their contents. What would the hash be based on? It needs to be deterministic, comparable, and independent of file path/name.

sumo-drosiek commented 2 years ago

I'm not sure there's anything to hash that would not degrade our ability to match files based on their contents. What would the hash be based on? It needs to be deterministic, comparable, and independent of file path/name.

I thought about storing hashes of first bytes of the file, e.g.: 1kb, 2kb, 4kb, 8kb, 16kb, etc so, if file is 1kb, we compare only first hash, but if it have move than 16kb, we compare all of the hashes between 1kb and 16kb. So this is similar fingerprint algorithm, but relies on hashes and size of the file.

Downside is that this is more complex and require more resources than current solution. And it doesn't resolve re-reading file which is already being rotated, however, we could made some more complex validation in case hashes are the same, but one file is much behind the second, so there is less hashes than for the other.

djaglowski commented 2 years ago

I think comparing the actual first bytes is equivalent to comparing hashes as you have suggested. The same cases can be handled, but the performance is better without hashing.

sumo-drosiek commented 2 years ago

I think comparing the actual first bytes is equivalent to comparing hashes as you have suggested.

hashes are smaller and can cover much bigger files than first bytes, but TBH, log files should have some kind of unique identifier e.g. timestamp, so at the end of the day I think current solution is good enough.

It just would be nice to get some warning in case of the offset being behind the file. It would help in investigation in case someone hit the issue.

djaglowski commented 2 years ago

hashes are smaller and can cover much bigger files than first bytes

Sure, there are performance pros/cons either way, but since the issue at hand is to handle additional cases I am pointing out that I see how the proposal would do that. Maybe I am missing it though.

in case of the offset being behind the file

Setting aside what should be done about this, it appears some refactoring will be necessary to detect this case. As is, we are deduplicating fingerprints before offsets are available for consideration. This codebase is overdue for some refactoring anyways. I will try to spend some time on it soon.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

open-telemetry / opentelemetry-collector-contrib

[receiver/filelogreceiver] offset could be taken as additional factor for fingerprinting #10481