Closed mikehardenize closed 3 years ago
@lukesteensen can you please take a look - seems like fingerprint change related.
Hey @mikehardenize, apologies for the slow response here. I have a few questions that'll help us dig into this:
test0.log
a rotated version of test.log
, test1.log
the previous version of test0.log
, etc, or are these independent log streams?I can answer your third question. They are independent log streams. They are all being written to at the same time. I can't answer your other two questions as I downgraded from the broken version a couple of weeks ago and can't upgrade again until this is fixed.
I will follow-up as I'm still on the verstion with the issue.
Jan 21 10:40:30 crnt-d10-web vector[6373]: Jan 21 10:40:30.538 INFO source{component_kind="source" component_name=log_files component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/ufw.log" old_path="/var/log/kern.log"
Jan 21 10:41:39 crnt-d10-web vector[6373]: Jan 21 10:41:39.264 INFO source{component_kind="source" component_name=log_files component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/var/log/apt/term.log" old_path="/var/log/apt/history.log"
First lines are the same for both files - even if rest of fhe file is different in terms of size
ebian@crnt-d10-web:~$ ls -alh /var/log/apt/*.log
-rw-r--r-- 1 root root 1.1K Jan 18 12:10 /var/log/apt/history.log
-rw-r----- 1 root adm 2.9K Jan 18 12:10 /var/log/apt/term.log
debian@crnt-d10-web:~$ ls -alh /var/log/ufw.log /var/log/kern.log
-rw-r----- 1 root adm 4.8M Jan 21 10:45 /var/log/kern.log
-rw-r----- 1 root adm 4.8M Jan 21 10:45 /var/log/ufw.log
debian@crnt-d10-web:~$ diff -rq /var/log/kern.log /var/log/ufw.log
Files /var/log/kern.log and /var/log/ufw.log differ
For the last one, it may be that one is synched with the other one somehow
First lines are the same for both files
This is probably the key. For context, we changed our default strategy for uniquely identifying files from looking at the first 256 bytes of the file to looking at the first line in the file. The intent was to work better for both short log lines and long log lines that put unique info like timestamps at the end.
In this case it seems that we could be taking less information into account if you have initial an non-unique, short header line at the beginning of each file. We do have the ignored_header_bytes
configuration option that is designed to let you skip over non-unique constants that are present at the beginning of each file. If you have a known header like that, I would recommend using that option to skip it. If it's not a known header but you're still seeing this issue, I'd be curious to hear more about these files are written and see what we can come up with to better identify them.
Hmmm I wonder if this settings will not just move the issue a few bytes further. Or sometimes it would work (when content is different) and sometimes not when content is the same :thinking:
Shoud I switch to the device_and_inode
strategy instead of checksum ?
@nsteinmetz that option is intended for cases where there is a known, fixed-sized header that you can skip over reliably. If that's not the case for you, then I agree it's probably not the best fit.
I'd be able to offer some more specific advice if I better understood the source of your lack of uniqueness here. The only case in which your behavior should have degraded from 0.10 to 0.11 is one where you had uniqueness within the first 256 bytes of each log file, but not within the first line of each log file. That implies some kind of short line at the beginning of each file, likely without a precise timestamp or other source of uniqueness. My assumption was that it could be some kind of fixed header.
If you simply have a lot of log lines that are identical, then yes you'd likely be better served by the device_and_inode
strategy. It has a few of its own edge cases, which is why it's not the default, but in most cases, it should work quite well. If you're able to give any more detail on your situation, however, I'd love to try to improve our checksums to handle it gracefully.
@lukesteensen thanks for the precision.
As it's systems log (ie generated by debian core system or some software I added like ufw), so I don't know much about the logic behind this files and their lifecyles.
Which makes also me think that if there is some duplicate between the two, maybe I can keep only one of the two and avoid double log ingestion with I guess no extra value. Or maybe just exclude all of them if they are useless 😉
Just took some time to see the kern.log and ufw.log - from /etc/rsyslog.d/20-ufw.conf
# Log kernel generated UFW log messages to file
:msg,contains,"[UFW " /var/log/ufw.log
# Uncomment the following to stop logging anything that matches the last rule.
# Doing this will stop logging kernel generated UFW log messages to the file
# normally containing kern.* messages (eg, /var/log/kern.log)
#& stop
So I can exclude one of the two or tweak rsyslog configuration.
The first line of each of my log files is the same. They're generated by a custom application written in house. For example, looking at each of them now they start with:
2021/01/26 03:10:01 Google PubSub enabled. Project hardenize-ct-production, Topic ctmon
Subsequent lines are different. The date/time is the same for all of them, as the application which generates them is started for each of them at the same time using cron.
It sounds like I could either use device_and_inode, or alternatively modify my application so that the first line it logs when run is unique.
However, I really think that this should be solved by the Vector software automatically. I guarantee there will be a tonne of log files out there that have non-unique first lines, and this would trip up a lot of people.
Just in case:
I had spam of following messages in "journalctl -fu vector". It started to happen after I've added two files in the same monitoring file source ["SERVER.Notification.0.log", "SERVER.Notification.0.log"] . Separating them back to different file sources solved the issue:
info: 20210610 15:00:17 host vector: Jun 10 15:00:17.947 INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1623322805, tv_nsec: 612093650 } old_modified_time=SystemTime { tv_sec: 1623276008, tv_nsec: 410451778 }
info: 20210610 15:00:20 host vector: Jun 10 15:00:19.998 INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Watched file has been renamed. path="/SERVER.System.0.log" old_path="/SERVER.Business.0.log"
info: 20210610 15:00:20 host vector: Jun 10 15:00:20.000 INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: More than one file has the same fingerprint. path="/SERVER.Business.0.log" old_path="/SERVER.System.0.log"
info: 20210610 15:00:20 host vector: Jun 10 15:00:20.000 INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Switching to watch most recently modified file. new_modified_time=SystemTime { tv_sec: 1623322818, tv_nsec: 910057722 } old_modified_time=SystemTime { tv_sec: 1623276008, tv_nsec: 410451778 }
info: 20210610 15:00:22 host vector: Jun 10 15:00:22.051 INFO source{component_kind="source" component_name=ServSysLogFile component_type=file}:file_server: file_source::file_server: Watched file has been renamed.
FYI, this is still an ongoing issue for me which is preventing me from upgrading.
Thanks for the bump @mikehardenize I agree that we need to address this. There is ignored_header_bytes
but that seems insufficient for your case where it is just that you have commonly have the same log line appearing at the head of a file.
Do you think reading a configurable n
lines to fingerprint the file would obviate the issue for you?
That would hide the issue for my particular use case yes.
@ktff I assigned this one to you when you have some time. It sounds like the change here should just be to allow the number of lines used for checkpointing to be configurable.
cc/ @lukesteensen for any additional thoughts.
Option like that sounds good, but I'll see if we can also do it automatically since we do detect these situations so it may be possible to just take additional lines, with a limit, until we can distinguish between the files.
Option like that sounds good, but I'll see if we can also do it automatically since we do detect these situations so it may be possible to just take additional lines, with a limit, until we can distinguish between the files.
🤔 that would be nice, but I'm not seeing how we could do that without reading some max number of lines to calculate the checksum when checkpointing since we can't go back and read more later if we find two similar looking files. I'm curious what you find though. At the least, adding it as a config option seems like it'd work for this case.
since we can't go back and read more later if we find two similar looking files. I'm curious what you find though.
True, I don't see any simple way of changing that so I've just added the option.
Vector Version
Vector Configuration File
Expected Behavior
It should work
Actual Behavior
https://packages.timber.io/vector/0.10.X/vector-x86_64.rpm was working fine. I upgraded to https://packages.timber.io/vector/0.11.X/vector-0.11.1-1-musl.x86_64.rpm
And now I just get the following logged every second or so:
There's a lot of "More than one file has the same fingerprint" in there. These log files do not contain the same content.
I downgraded to the old version, and these log entries stopped.