broken gzip file produced sometimes.

TomonoriSoejima commented 5 years ago

Version: Logstash 6.4.0
Steps to Reproduce:

Simply create a conf with file output plugin with gzip enabled as below.

output { 
    file { 
        path => "/Users/surfer/elastic/labs/logstash/logstash.config/output/a-%{+YYYYMMddHHmm}.json.gz" 
        codec => "json_lines" 
        gzip => "true" 
    }
}

ingest document
Notice that when you verify the file, it comes back with messages below.

$ gzip -tv test*

test.json.gz:

gzip: test.json.gz: decompression OK, trailing garbage ignored

 OK

colinsurprenant commented 5 years ago

Looking at the file output code, I think I see what may be happening; nothing prevents a file from being correctly closed and then re-opened to further append to the file. This would normally be harmless with a normal line appending file, but for a gzipped file once it is closed we can see how data added after would be considered garbage by the gzip decompressor. I am not sure at this point how we should deal with this problem. Some ideas:

See if there are better strategies other than closing a file when it reached the inactivity timer of (hardcoded) 10 seconds.
Once a file is closed, offer the option to not reopen files with a .gz suffix and open a new file with an added sequence number to create a new file name?
Offer to do a post-close operation like renaming/moving the file?

colinsurprenant commented 5 years ago

Maybe @guyboertje has ideas for moving this forward?

TomonoriSoejima commented 5 years ago

How about looking up other open source projects that are using Gzip library and see if there are sample code out there working around a similar problem?

I know it sounds laborious, but just a thought raised by the user who reported the issue.

colinsurprenant commented 5 years ago

@TomonoriSoejima

Problem

After further investigation, this is what is actually happening. There are a few things to consider:

1- While logstash is running there is no way to know if an output file is correctly closed

The way the file output works today is to close output files based on inactivity (currently hardcoded 10 seconds). If an output file (gzipped or not) has not received events for 10s if should be closed. But there is a caveat here: the inactivity check is done upon reception of events (in the events receive loop), and will typically work as intended when there is a constant stream of events and many output files are created (using the string interpolation notation in the path option), where files not written-to anymore will be determined inactive and closed. BUT This strategy does not work if there is one fixed output file or for the currently written-to file when the stream of event stops for a while; since there are no event flowing, the check is not done and the file will not be closed because of inactivity.

Regardless of the above caveat, in the end, while logstash is running, there is no safe way to know if an output file is correctly closed. The only safe way is to stop logstash. This brings the 2nd thing to consider:

2- A close operation is necessary to create a valid gzip file.

Without an explicit close operation, the gzip writer be not able to write the gzip footer and will generate a broken gzip file.

3- A closed file can be reopened.

To add to the confusion, a previously closed file can be reopened. This is another problem which adds to the "there is no safe way to know if an output file is correctly closed". I believe this is generally a bad idea and only works if you consume the output files in a tailing mode. Note that for gzip you cannot tail a file and closing and re-opening a gzip file for writing actually adds new files inside de gzip container; this is usually seamless because the gzip decompressors will typically merge all individual files inside the gzip file as one big file but it could create problems when using potentially "other" gzip decompressors.

Conclusion

The file output current design does not offer good options to figure which files are done-with and can be safely be used by other processes. I will see what can be done about this. One good idea, as per @guyboertje suggested in #73 is to use a strategy such that while a file is "active"/written-to it uses a temporary extension and upon closing the file, do an atomic rename to the final name. The is the safest low-fi cross-process synchronization possible.

nit23uec commented 5 years ago

@colinsurprenant , I've tried to address this issue in a PR: https://github.com/logstash-plugins/logstash-output-file/pull/82 the strategy is basically to write continuous events to a temp file and then move the contents of temp file to the original gzip file on the flush interval. As this requires reading temp file and copying its contents to gzip file and then truncating the temp file, the operation is non-atomic, but still it reduces the probability of getting a broken gzip by a considerable percentage. It would be great of you can provide your feedback on the PR! Thanks.

Djeezus commented 3 years ago

Hi all, I have a somewhat related issue ... with 2 logstash instances concurrently gzip-appending to same file. They output file become corrupt too ... but, is this even supposed to work at all with compression ? I've had no problems concurrently appending without compression; probably due to properly applying O_APPEND write() operations, right ?

andsel commented 2 years ago

@Djeezus I think your problem is quite different from this, because two processes can't write to same gzip output file, but in general to any output, without corrupting the file itself. The 2 processes has to be synchronized and that use case, I think, is outside of the output file plugin context.

logstash-plugins / logstash-output-file

broken gzip file produced sometimes. #79

Problem

Conclusion