vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.27k stars 1.61k forks source link

File source getting confused #5833

Closed mikehardenize closed 3 years ago

mikehardenize commented 3 years ago

Vector Version

https://packages.timber.io/vector/0.11.X/vector-0.11.1-1-musl.x86_64.rpm

Vector Configuration File

[sources.test_to_stackdriver_source]
  type         = "file"
  ignore_older = 86400
  include      = ["/var/log/test/*.log"]

[transforms.test_to_stackdriver_fix_host]
  type        = "add_fields"
  inputs      = ["test_to_stackdriver_source"]
  fields.host = "obfuscated-hostname"

[sinks.test_to_stackdriver_sink]
  type                 = "gcp_stackdriver_logs"
  inputs               = ["test_to_stackdriver_fix_host"]
  credentials_path     = "/etc/vector/conf.d/.gcp-credentials"
  tls.ca_file          = "/etc/ssl/certs/ca-bundle.crt"
  log_id               = "test"
  project_id           = "obfuscated-projectid"
  resource.type        = "gce_instance"
  resource.instance_id = "obfuscated-instance_id"
  resource.zone        = "us-east4-a"

Expected Behavior

It should work

Actual Behavior

https://packages.timber.io/vector/0.10.X/vector-x86_64.rpm was working fine. I upgraded to https://packages.timber.io/vector/0.11.X/vector-0.11.1-1-musl.x86_64.rpm

And now I just get the following logged every second or so:

Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.546  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: Watched file has been renamed. path="/var/log/test/test1.log" old_path="/var/log/test/test3.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.546  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/test/test0.log" old_path="/var/log/test/test1.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.546  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1609765456, tv_nsec: 317367352 } old_modified_time=SystemTime { tv_sec: 1609765441, tv_nsec: 787400489 }
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.547  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/test/test7.log" old_path="/var/log/test/test0.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.547  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/test/test6.log" old_path="/var/log/test/test0.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.547  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/test/test5.log" old_path="/var/log/test/test0.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.548  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/test/test4.log" old_path="/var/log/test/test0.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.548  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/test/test2.log" old_path="/var/log/test/test0.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.549  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/test/test3.log" old_path="/var/log/test/test0.log"
Jan 04 13:04:31 obfuscated-hostname vector[28275]: Jan 04 13:04:31.549  INFO source{component_kind="source" component_name=test_to_stackdriver_source component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1609765466, tv_nsec: 536047331 } old_modified_time=SystemTime { tv_sec: 1609765456, tv_nsec: 317367352 }

There's a lot of "More than one file has the same fingerprint" in there. These log files do not contain the same content.

I downgraded to the old version, and these log entries stopped.

jamtur01 commented 3 years ago

@lukesteensen can you please take a look - seems like fingerprint change related.

lukesteensen commented 3 years ago

Hey @mikehardenize, apologies for the slow response here. I have a few questions that'll help us dig into this:

  1. Can you clarify if the data you're getting out of the file source is changing or if it's just these log messages that you're seeing?
  2. Could you share the first line or two of these files when you're seeing this log message?
  3. How are your log files being written to and/or rotated? i.e. is test0.log a rotated version of test.log, test1.log the previous version of test0.log, etc, or are these independent log streams?
mikehardenize commented 3 years ago

I can answer your third question. They are independent log streams. They are all being written to at the same time. I can't answer your other two questions as I downgraded from the broken version a couple of weeks ago and can't upgrade again until this is fixed.

nsteinmetz commented 3 years ago

I will follow-up as I'm still on the verstion with the issue.

Jan 21 10:40:30 crnt-d10-web vector[6373]: Jan 21 10:40:30.538  INFO source{component_kind="source" component_name=log_files component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/ufw.log" old_path="/var/log/kern.log"
Jan 21 10:41:39 crnt-d10-web vector[6373]: Jan 21 10:41:39.264  INFO source{component_kind="source" component_name=log_files component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/apt/term.log" old_path="/var/log/apt/history.log"

First lines are the same for both files - even if rest of fhe file is different in terms of size

ebian@crnt-d10-web:~$ ls -alh /var/log/apt/*.log
-rw-r--r-- 1 root root 1.1K Jan 18 12:10 /var/log/apt/history.log
-rw-r----- 1 root adm  2.9K Jan 18 12:10 /var/log/apt/term.log

debian@crnt-d10-web:~$ ls -alh /var/log/ufw.log /var/log/kern.log
-rw-r----- 1 root adm 4.8M Jan 21 10:45 /var/log/kern.log
-rw-r----- 1 root adm 4.8M Jan 21 10:45 /var/log/ufw.log

debian@crnt-d10-web:~$ diff  -rq /var/log/kern.log /var/log/ufw.log
Files /var/log/kern.log and /var/log/ufw.log differ

For the last one, it may be that one is synched with the other one somehow

lukesteensen commented 3 years ago

First lines are the same for both files

This is probably the key. For context, we changed our default strategy for uniquely identifying files from looking at the first 256 bytes of the file to looking at the first line in the file. The intent was to work better for both short log lines and long log lines that put unique info like timestamps at the end.

In this case it seems that we could be taking less information into account if you have initial an non-unique, short header line at the beginning of each file. We do have the ignored_header_bytes configuration option that is designed to let you skip over non-unique constants that are present at the beginning of each file. If you have a known header like that, I would recommend using that option to skip it. If it's not a known header but you're still seeing this issue, I'd be curious to hear more about these files are written and see what we can come up with to better identify them.

nsteinmetz commented 3 years ago

Hmmm I wonder if this settings will not just move the issue a few bytes further. Or sometimes it would work (when content is different) and sometimes not when content is the same :thinking:

Shoud I switch to the device_and_inode strategy instead of checksum ?

lukesteensen commented 3 years ago

@nsteinmetz that option is intended for cases where there is a known, fixed-sized header that you can skip over reliably. If that's not the case for you, then I agree it's probably not the best fit.

I'd be able to offer some more specific advice if I better understood the source of your lack of uniqueness here. The only case in which your behavior should have degraded from 0.10 to 0.11 is one where you had uniqueness within the first 256 bytes of each log file, but not within the first line of each log file. That implies some kind of short line at the beginning of each file, likely without a precise timestamp or other source of uniqueness. My assumption was that it could be some kind of fixed header.

If you simply have a lot of log lines that are identical, then yes you'd likely be better served by the device_and_inode strategy. It has a few of its own edge cases, which is why it's not the default, but in most cases, it should work quite well. If you're able to give any more detail on your situation, however, I'd love to try to improve our checksums to handle it gracefully.

nsteinmetz commented 3 years ago

@lukesteensen thanks for the precision.

As it's systems log (ie generated by debian core system or some software I added like ufw), so I don't know much about the logic behind this files and their lifecyles.

Which makes also me think that if there is some duplicate between the two, maybe I can keep only one of the two and avoid double log ingestion with I guess no extra value. Or maybe just exclude all of them if they are useless 😉

Just took some time to see the kern.log and ufw.log - from /etc/rsyslog.d/20-ufw.conf

# Log kernel generated UFW log messages to file
:msg,contains,"[UFW " /var/log/ufw.log

# Uncomment the following to stop logging anything that matches the last rule.
# Doing this will stop logging kernel generated UFW log messages to the file
# normally containing kern.* messages (eg, /var/log/kern.log)
#& stop

So I can exclude one of the two or tweak rsyslog configuration.

mikehardenize commented 3 years ago

The first line of each of my log files is the same. They're generated by a custom application written in house. For example, looking at each of them now they start with:

2021/01/26 03:10:01 Google PubSub enabled. Project hardenize-ct-production, Topic ctmon

Subsequent lines are different. The date/time is the same for all of them, as the application which generates them is started for each of them at the same time using cron.

mikehardenize commented 3 years ago

It sounds like I could either use device_and_inode, or alternatively modify my application so that the first line it logs when run is unique.

However, I really think that this should be solved by the Vector software automatically. I guarantee there will be a tonne of log files out there that have non-unique first lines, and this would trip up a lot of people.

cnekmp commented 3 years ago

Just in case:

I had spam of following messages in "journalctl -fu vector". It started to happen after I've added two files in the same monitoring file source ["SERVER.Notification.0.log", "SERVER.Notification.0.log"] . Separating them back to different file sources solved the issue:

info: 20210610 15:00:17 host vector: Jun 10 15:00:17.947  INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1623322805, tv_nsec: 612093650 } old_modified_time=SystemTime { tv_sec: 1623276008, tv_nsec: 410451778 }
info: 20210610 15:00:20 host vector: Jun 10 15:00:19.998  INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Watched file has been renamed. path="/SERVER.System.0.log" old_path="/SERVER.Business.0.log"
info: 20210610 15:00:20 host vector: Jun 10 15:00:20.000  INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/SERVER.Business.0.log" old_path="/SERVER.System.0.log"
info: 20210610 15:00:20 host vector: Jun 10 15:00:20.000  INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1623322818, tv_nsec: 910057722 } old_modified_time=SystemTime { tv_sec: 1623276008, tv_nsec: 410451778 }
info: 20210610 15:00:22 host vector: Jun 10 15:00:22.051  INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Watched file has been renamed. 
mikehardenize commented 3 years ago

FYI, this is still an ongoing issue for me which is preventing me from upgrading.

jszwedko commented 3 years ago

Thanks for the bump @mikehardenize I agree that we need to address this. There is ignored_header_bytes but that seems insufficient for your case where it is just that you have commonly have the same log line appearing at the head of a file.

Do you think reading a configurable n lines to fingerprint the file would obviate the issue for you?

mikehardenize commented 3 years ago

That would hide the issue for my particular use case yes.

jszwedko commented 3 years ago

@ktff I assigned this one to you when you have some time. It sounds like the change here should just be to allow the number of lines used for checkpointing to be configurable.

cc/ @lukesteensen for any additional thoughts.

ktff commented 3 years ago

Option like that sounds good, but I'll see if we can also do it automatically since we do detect these situations so it may be possible to just take additional lines, with a limit, until we can distinguish between the files.

jszwedko commented 3 years ago

Option like that sounds good, but I'll see if we can also do it automatically since we do detect these situations so it may be possible to just take additional lines, with a limit, until we can distinguish between the files.

🤔 that would be nice, but I'm not seeing how we could do that without reading some max number of lines to calculate the checksum when checkpointing since we can't go back and read more later if we find two similar looking files. I'm curious what you find though. At the least, adding it as a config option seems like it'd work for this case.

ktff commented 3 years ago

since we can't go back and read more later if we find two similar looking files. I'm curious what you find though.

True, I don't see any simple way of changing that so I've just added the option.