Closed davidski closed 3 years ago
Did a bit more thinking on this one. Changing to >=
is a problem for folks that aren't using the delete mode, as files with the same timestamp could be read over and over. A better solution would be to increase the hash to not just the last_modified
timestamp but to be a combination of last_modified
and the filename.
I have this same issue as well. The problem doesn't occur when the files with the same timestamp are processed in the same run of the process_files
call. It only happens when say first file1.gz
and file2.gz
are processed and the sincedb timestamp is set to 2015-01-01 08:00:00Z
in one call to process_files
and interval
seconds later file3.gz
is processed in the next call to process_files
. In this case file3.gz
is skipped because its timestamp is less than (<
) the sincedb timestamp.
Only by keeping the full list of already processed files that have last_modified timestamp, can the next run of process_files
compare against that list and skip those files.
Example
process_files
Timestamp | Log Object Key |
---|---|
2016-05-12 14:26:00 | 2016-05-12-12-25-12-D68E3F15AAE3DC9A |
2016-05-12 14:26:00 | 2016-05-12-12-25-15-58C0CBDF807CD34B |
2016-05-12 14:26:00 | 2016-05-12-12-25-26-5FC01F0ED4E5A430 |
Keep the list (in sincedb?) of allready processed files; in this case all three files:
2016-05-12 14:26:00 2016-05-12-12-25-12-D68E3F15AAE3DC9A,2016-05-12-12-25-15-58C0CBDF807CD34B,2016-05-12-12-25-26-5FC01F0ED4E5A430
process_files
(interval
seconds later)Find files >= 2016-05-12 14:26:00
and skip if name matches a name on the list of already processed files:
Timestamp | Log Object Key | Skip/Process |
---|---|---|
2016-05-12 14:26:00 | 2016-05-12-12-25-12-D68E3F15AAE3DC9A | skip |
2016-05-12 14:26:00 | 2016-05-12-12-25-15-58C0CBDF807CD34B | skip |
2016-05-12 14:26:00 | 2016-05-12-12-25-26-5FC01F0ED4E5A430 | skip |
2016-05-12 14:26:00 | 2016-05-12-12-25-28-FFC2D6CB851FA917 | process |
2016-05-12 14:26:00 | 2016-05-12-12-25-31-213B56F74A224293 | process |
2016-05-12 14:26:00 | 2016-05-12-12-25-43-BAA23FC4D63F463B | process |
2016-05-12 14:26:00 | 2016-05-12-12-25-59-327A55EDBF5CCDF2 | process |
New list: all 7 files with timestamp 2016-05-12 14:26:00
.
This is an acceptable solution because the list of filenames to save and compare against is small in practice.
Dumb workaround if you're using 'delete': set the registry contents to a really old date once in a while.
Yes, that would work even if you're using 'backup'. But it defeats the purpose of automation.
Had the same issue, realized I could set sincedb_path
to /dev/null
to ignore mtime. This can solve it if you have delete enabled at least.
I've seen this issue as well, the way I can understand it is that list_new_files can run at e.g. 08:00:00.04, processing a file modified at 08:00:00.03, and 08:00:00 is written to the sincedb. If a file is added to the S3 bucket with a modified time of 08:00:00.07 for example, it will not have been picked up by the first run of list_new_files, and will be detected as "Object Not Modified" on the second run.
I've submitted a PR to define a cut-off time which is 2 seconds in the past, so anything modified within 2 seconds of list_new_files running will be ignored.
It's probably not the most elegant of solutions but it's worked for us.
We are seeing similar issues when processing CloudTrail logs that are being sent to a centralized bucket from multiple accounts in the format s3://bucket/AWSLogs/ACCOUNTID/YEAR/MONTH/DAY/
. We occasionally see files from different accounts being put into the bucket with the same last modified date and when the files are not processed in the same iteration by Logstash they end up being skipped.
We are using the delete mode to remove the files after processing, so do not actually need to worry about the sincedb timestamp. To work around the issue we ended up adding a new setting (sincedb_disabled) to the input that when set to true
will skip the "sincedb.newer" check.
Thank you for reporting the issue. The new release v3.6.0 fixed it
I've had a long standing issue with LS 1.5.x leaving files pending in my S3 input bucket and not being read. In tracing this down this (https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L408) test looks suspect to me. I am processing S3 files on a batch basis and have multiple files present on start up with identical
last_modified
timestamps (second-level precision).It looks like the
newer
function, using a greater than, will process the first file it finds with the earliest timestamp, but skip over any subsequent ones.For example:
If I'm understanding the code,
file1.gz
will get parsed, butfile2.gz
andfile3.gz
will fail the>
test and will be skipped, leaving them hanging around after the conclusion of the batch run.Is my interpretation correct and could this be changed safely to
>=
? I'm using delete after read so the concern of re-reading the same file isn't a problem for me.