Multiple Files Being Left Unprocessed with Identical Timestamps

davidski commented 8 years ago

I've had a long standing issue with LS 1.5.x leaving files pending in my S3 input bucket and not being read. In tracing this down this (https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L408) test looks suspect to me. I am processing S3 files on a batch basis and have multiple files present on start up with identical last_modified timestamps (second-level precision).

It looks like the newer function, using a greater than, will process the first file it finds with the earliest timestamp, but skip over any subsequent ones.

For example:

Timestamp	File
2015-01-01 08:00:00Z	file1.gz
2015-01-01 08:00:00Z	file2.gz
2015-01-01 08:00:00Z	file3.gz
2015-01-01 08:00:01Z	file4.gz
2015-01-01 08:00:02Z	file5.gz
2015-01-01 08:00:03Z	file6.gz

If I'm understanding the code, file1.gz will get parsed, but file2.gz and file3.gz will fail the > test and will be skipped, leaving them hanging around after the conclusion of the batch run.

Is my interpretation correct and could this be changed safely to >=? I'm using delete after read so the concern of re-reading the same file isn't a problem for me.

davidski commented 8 years ago

Did a bit more thinking on this one. Changing to >= is a problem for folks that aren't using the delete mode, as files with the same timestamp could be read over and over. A better solution would be to increase the hash to not just the last_modified timestamp but to be a combination of last_modified and the filename.

kjwierenga commented 8 years ago

I have this same issue as well. The problem doesn't occur when the files with the same timestamp are processed in the same run of the process_files call. It only happens when say first file1.gz and file2.gz are processed and the sincedb timestamp is set to 2015-01-01 08:00:00Z in one call to process_files and interval seconds later file3.gz is processed in the next call to process_files. In this case file3.gz is skipped because its timestamp is less than (<) the sincedb timestamp.

Only by keeping the full list of already processed files that have last_modified timestamp, can the next run of process_files compare against that list and skip those files.

Example

1st call process_files

Timestamp	Log Object Key
2016-05-12 14:26:00	2016-05-12-12-25-12-D68E3F15AAE3DC9A
2016-05-12 14:26:00	2016-05-12-12-25-15-58C0CBDF807CD34B
2016-05-12 14:26:00	2016-05-12-12-25-26-5FC01F0ED4E5A430

Keep the list (in sincedb?) of allready processed files; in this case all three files:

2016-05-12 14:26:00 2016-05-12-12-25-12-D68E3F15AAE3DC9A,2016-05-12-12-25-15-58C0CBDF807CD34B,2016-05-12-12-25-26-5FC01F0ED4E5A430

2nd call process_files (interval seconds later)

Find files >= 2016-05-12 14:26:00 and skip if name matches a name on the list of already processed files:

Timestamp	Log Object Key	Skip/Process
2016-05-12 14:26:00	2016-05-12-12-25-12-D68E3F15AAE3DC9A	skip
2016-05-12 14:26:00	2016-05-12-12-25-15-58C0CBDF807CD34B	skip
2016-05-12 14:26:00	2016-05-12-12-25-26-5FC01F0ED4E5A430	skip
2016-05-12 14:26:00	2016-05-12-12-25-28-FFC2D6CB851FA917	process
2016-05-12 14:26:00	2016-05-12-12-25-31-213B56F74A224293	process
2016-05-12 14:26:00	2016-05-12-12-25-43-BAA23FC4D63F463B	process
2016-05-12 14:26:00	2016-05-12-12-25-59-327A55EDBF5CCDF2	process

New list: all 7 files with timestamp 2016-05-12 14:26:00.

This is an acceptable solution because the list of filenames to save and compare against is small in practice.

webmstr commented 8 years ago

Dumb workaround if you're using 'delete': set the registry contents to a really old date once in a while.

kjwierenga commented 8 years ago

Yes, that would work even if you're using 'backup'. But it defeats the purpose of automation.

tiwilliam commented 8 years ago

Had the same issue, realized I could set sincedb_path to /dev/null to ignore mtime. This can solve it if you have delete enabled at least.

lsutcliffe-dbg commented 4 years ago

I've seen this issue as well, the way I can understand it is that list_new_files can run at e.g. 08:00:00.04, processing a file modified at 08:00:00.03, and 08:00:00 is written to the sincedb. If a file is added to the S3 bucket with a modified time of 08:00:00.07 for example, it will not have been picked up by the first run of list_new_files, and will be detected as "Object Not Modified" on the second run.

I've submitted a PR to define a cut-off time which is 2 seconds in the past, so anything modified within 2 seconds of list_new_files running will be ignored.

It's probably not the most elegant of solutions but it's worked for us.

SpencerLN commented 4 years ago

We are seeing similar issues when processing CloudTrail logs that are being sent to a centralized bucket from multiple accounts in the format s3://bucket/AWSLogs/ACCOUNTID/YEAR/MONTH/DAY/. We occasionally see files from different accounts being put into the bucket with the same last modified date and when the files are not processed in the same iteration by Logstash they end up being skipped.

We are using the delete mode to remove the files after processing, so do not actually need to worry about the sincedb timestamp. To work around the issue we ended up adding a new setting (sincedb_disabled) to the input that when set to true will skip the "sincedb.newer" check.

kaisecheng commented 3 years ago

Thank you for reporting the issue. The new release v3.6.0 fixed it

logstash-plugins / logstash-input-s3

Multiple Files Being Left Unprocessed with Identical Timestamps #57