`file` source: `checksum` fingerprint is not correct with gzipped files

hhromic commented 2 years ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The file source component is not reading gzipped files correctly for the purpose of fingerprinting using the checksum strategy. In particular it seems that it is counting lines in the compressed data instead of the decompressed data. Therefore, when the compresed data does not have newline characters, or fewer newlines than requested in the lines configuration, Vector refuses to process the file with a "file too small for fingerprinting" error.

Consider the following Vector pipeline configuration:

sources:
  file:
    type: file
    include:
      - input.txt.gz
    fingerprint:
      strategy: checksum
      ignored_header_bytes: 0  # the docs say this is optional but Vector disagrees
      lines: ${FINGERPRINT_LINES:-1}  # this is the default in Vector
    ignore_checkpoints: true  # just so testing is easily repeatable
    data_dir: data/  # make sure this directory exists where you run the pipeline
sinks:
  console:
    type: console
    inputs: ["file"]
    encoding:
      codec: json

The sample input.txt.gz file attached down below contains 200 lines of text of the form line X. The input.txt file attached is just the decompressed version of the former for further testing. I am also including an input2.txt.gz file with just 10 lines and its corresponding decompressed version input2.txt for a further demonstration down below.

If you run the above Vector pipeline using the input.txt.gz file and the default lines (1), you obtain:

$ vector-0.22.2/bin/vector -c vector.yaml
2022-06-16T18:09:54.812109Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,kube=info"
2022-06-16T18:09:54.812421Z  INFO vector::app: Loading configs. paths=["vector.yaml"]
2022-06-16T18:09:54.821652Z  INFO vector::topology::running: Running healthchecks.
2022-06-16T18:09:54.824489Z  INFO vector: Vector has started. debug="false" version="0.22.2" arch="x86_64" build_id="0024c92 2022-06-15"
2022-06-16T18:09:54.824924Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2022-06-16T18:09:54.828046Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}: vector::sources::file: Starting file server. include=["input.txt.gz"] exclude=[]
2022-06-16T18:09:54.828748Z  INFO vector::topology::builder: Healthcheck: Passed.
2022-06-16T18:09:54.833847Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}:file_server: file_source::checkpointer: Loaded checkpoint data.
2022-06-16T18:09:54.834943Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=input.txt.gz
{"file":"input.txt.gz","host":"lenny","message":"line 1","source_type":"file","timestamp":"2022-06-16T18:09:54.836595900Z"}
{"file":"input.txt.gz","host":"lenny","message":"line 2","source_type":"file","timestamp":"2022-06-16T18:09:54.836618400Z"}
{"file":"input.txt.gz","host":"lenny","message":"line 3","source_type":"file","timestamp":"2022-06-16T18:09:54.836627300Z"}
{"file":"input.txt.gz","host":"lenny","message":"line 4","source_type":"file","timestamp":"2022-06-16T18:09:54.836635800Z"}
...

Which is correct and demonstrates that Vector can indeed transparently read gzipped files.

If we ask the fingerprinter to skip 4 lines before performing a checksum:

$ FINGERPRINT_LINES=4 vector-0.22.2/bin/vector -c vector.yaml
2022-06-16T18:12:06.410580Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,kube=info"
2022-06-16T18:12:06.410728Z  INFO vector::app: Loading configs. paths=["vector.yaml"]
2022-06-16T18:12:06.412588Z  INFO vector::topology::running: Running healthchecks.
2022-06-16T18:12:06.412743Z  INFO vector::topology::builder: Healthcheck: Passed.
2022-06-16T18:12:06.412851Z  INFO vector: Vector has started. debug="false" version="0.22.2" arch="x86_64" build_id="0024c92 2022-06-15"
2022-06-16T18:12:06.412925Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2022-06-16T18:12:06.413067Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}: vector::sources::file: Starting file server. include=["input.txt.gz"] exclude=[]
2022-06-16T18:12:06.418885Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}:file_server: file_source::checkpointer: Loaded checkpoint data.
2022-06-16T18:12:06.419153Z  WARN source{component_kind="source" component_id=file component_type=file component_name=file}:file_server: vector::internal_events::file::source: Currently ignoring file too small to fingerprint. file=input.txt.gz

It can be seen that Vector cannot fingerprint the file due to it being "too small".

However, if we configure the fingerprinter to skip 3 lines, then it works again:

$ FINGERPRINT_LINES=3 vector-0.22.2/bin/vector -c vector.yaml
2022-06-16T18:13:16.588113Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,kube=info"
2022-06-16T18:13:16.588256Z  INFO vector::app: Loading configs. paths=["vector.yaml"]
2022-06-16T18:13:16.590364Z  INFO vector::topology::running: Running healthchecks.
2022-06-16T18:13:16.590749Z  INFO vector: Vector has started. debug="false" version="0.22.2" arch="x86_64" build_id="0024c92 2022-06-15"
2022-06-16T18:13:16.590866Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2022-06-16T18:13:16.596942Z  INFO vector::topology::builder: Healthcheck: Passed.
2022-06-16T18:13:16.598670Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}: vector::sources::file: Starting file server. include=["input.txt.gz"] exclude=[]
2022-06-16T18:13:16.602360Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}:file_server: file_source::checkpointer: Loaded checkpoint data.
2022-06-16T18:13:16.603006Z  INFO source{component_kind="source" component_id=file component_type=file component_name=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=input.txt.gz
{"file":"input.txt.gz","host":"lenny","message":"line 1","source_type":"file","timestamp":"2022-06-16T18:13:16.603910200Z"}
{"file":"input.txt.gz","host":"lenny","message":"line 2","source_type":"file","timestamp":"2022-06-16T18:13:16.603925Z"}
{"file":"input.txt.gz","host":"lenny","message":"line 3","source_type":"file","timestamp":"2022-06-16T18:13:16.603932700Z"}
{"file":"input.txt.gz","host":"lenny","message":"line 4","source_type":"file","timestamp":"2022-06-16T18:13:16.603940300Z"}
...

Where is this magic number 3 coming from? If we examine the hexdump of the compressed input.txt.gz file (GitHub won't color here 😢):

$ hexdump -C input.txt.gz  | grep --color=auto 0a
00000020  09 bc fc b3 20 07 06 04  f5 1f 0a 02 67 e8 e8 65  |.... .......g..e|
00000050  00 05 29 50 c1 0a 58 d0  02 17 bc 8a 57 fd bb f0  |..)P..X.....W...|
00000090  03 6f e0 0d bf 0a bc 81  37 f0 06 de c0 9b 78 13  |.o......7.....x.|
000000a0  6f e2 4d bc 89 37 fd 6e  f1 26 de c4 9b 78 0b 6f  |o.M..7.n.&...x.o|

It can be seen that the file casually contains three newline characters \x0a, explaining the behaviour.

We noted that if the gzipped file has no newline characters (no matter how big it is), then the fingerprinter will always report the file as "too small" and Vector will never process said file. This is how we discovered this issue. We tried setting lines: 0 to no joy. The second input2.txt.gz file is an example without newlines that can't be processed by Vector due to this.

In summary, we believe that the lines configuration should operate over the decompressed data, as otherwise it doesn't make much sense due to compressed data being binary and not text-based.

Configuration

No response

Version

0.22.2 and tested back as far as 0.17.3

Debug Output

No response

Example Data

input.txt.gz input.txt input2.txt.gz input2.txt

Additional Context

No response

References

No response

hhromic commented 1 year ago

Hi @jszwedko ! I just checked the current Vector version 0.25.1 and unfortunately I can confirm that this is still an issue :(

Do you think the team would have some bandwidth to check on this issue? We are unable to reliable process compressed files using Vector due to this problem :( In our humble opinion, this is a quite critical bug in the file source. Apologies for nagging! 🙈

bruceg commented 1 year ago

I wonder if a potential relatively easy solution would be to allow configuring fingerprint.bytes instead of fingerprint.lines. With this, the source would ignore newlines and just read that many bytes for the checksum. This would detect when a new compressed file is added to the path.

Unfortunately, this proposal would not detect when a previously-read plain text file has been compressed, thus having the same checksum. Doing a checksum of the data within the file is a much more involved modification.

hhromic commented 1 year ago

That's an interesting idea. Perhaps the fingerprinter could detect when uncompressed/compressed files are being used, for example with a simple file extension heuristic, and switch to fingerprint.lines or fingerprint.bytes accordingly? At least until a more robust/consistent solution based on the content regardless of compression can be devised.

I think the source should have the same semantics for the fingerprinter, no matter if the files are compressed or not, for a consistent user experience and to avoid "surprising behaviour". However I understand doing it right is a larger fish to fry and the above suggestion might be enough in the meantime.

naegelin commented 1 year ago

Confirming this is still an issue on vector 0.28.0

max-yan commented 1 year ago

(https://github.com/vectordotdev/vector/pull/6338) it's sad, I can't read small gzip files. Is it possible to disable fingerprints at all?

jszwedko commented 1 year ago

(#6338) it's sad, I can't read small gzip files. Is it possible to disable fingerprints at all?

Yes, you can use the device_and_inode fingerprint strategy instead of the checksum one.

max-yan commented 1 year ago

(#6338) it's sad, I can't read small gzip files. Is it possible to disable fingerprints at all?

Yes, you can use the device_and_inode fingerprint strategy instead of the checksum one.

With 'remove_after_secs = 1' new files have the same inode.

jszwedko commented 1 year ago

(#6338) it's sad, I can't read small gzip files. Is it possible to disable fingerprints at all?

Yes, you can use the device_and_inode fingerprint strategy instead of the checksum one.

With 'remove_after_secs = 1' new files have the same inode.

Inode reuse is a complication here unfortunately 🙁

hhromic commented 1 year ago

@jszwedko nice to see some activity in this issue! :) I'm wondering if the proposition from Bruce here and my further thoughts here could be considered at least as a temporary solution to this problem? As a further addition to my own comment, maybe simpler than a heuristic is just to offer both options (as mutually exclusive) and let the user choose the type of fingerprinter (bytes- or line-based) for the use case.

This is still an important issue for us and this time we have been just hoping for this bug to not trigger. Thanks for all your work anyway in Vector, it is a fundamental piece of tech for our software stacks!

jszwedko commented 1 year ago

Hey!

I could see introducing that feature, generally, to allow fingerprinting to be based on bytes rather than lines, but I am a bit concerned about the caveat that Bruce mentioned:

Unfortunately, this proposal would not detect when a previously-read plain text file has been compressed, thus having the same checksum. Doing a checksum of the data within the file is a much more involved modification.

As this might be a common occurrence during file rotation where previously plaintext files would be compressed. To accurately fingerprint them, Vector would need to read the head of the file, uncompressed, to compare with the fingerprint from before the file was rotated.

I think the better solution is to have Vector uncompress the head of the file to fingerprint it.

hhromic commented 1 year ago

As this might be a common occurrence during file rotation where previously plaintext files would be compressed. To accurately fingerprint them, Vector would need to read the head of the file, uncompressed, to compare with the fingerprint from before the file was rotated.

Oh, I didn't realise until now what Bruce really meant there 🤦. I see now that the issue would be plain text files previously processed by Vector that can get compressed during rotation and thus should be still ignored by Vector after compression.

If I understand correctly, given that Vector does not currently fingerprint based on the content of compressed data, this behaviour (not detecting a plaintext file that gets compressed) should be anyway happening today in Vector? Especially considering the bug I reported in this issue where it is clear that the fingerprinter acts on raw compressed data.

In other words, as of today, users should not be mixing compressed and uncompressed files in the same file source that can be the same files already processed. Users should be using the include or exclude configurations to avoid this situation anyway because even with the checksum line-based or device-inode fingerprinter, the issue is the same.

Therefore, I think the suggestion from Bruce (adding a bytes-based checksum fingerprinter) would actually not change this behaviour and thus should not impact existing users anyway. In our use case, files are never presented in plain text first.

TL;DR; I do think Bruce's suggestion woud help us (and others processing compressed files) without hurting existing users.

jszwedko commented 1 year ago

TL;DR; I do think Bruce's suggestion woud help us (and others processing compressed files) without hurting existing users.

👍 agreed, it does seem to be an improvement even if it isn't a complete fix.

vectordotdev / vector