Closed mashhurs closed 2 years ago
Hello, I saw that the fix code #249 has been merged into the logstash-plugins/logstash-integration-aws repository. Could you let me know when this fix will be merged into the logstash-plugins/logstash-output-s3 repository, or if it has already been fixed in a particular version of logstash?
Issue description
When using
GZIP
encoding option with output to AWS S3 plugin, there are cases where Logstash may be crashed. When Logstash crashed GZIP stream is left opened and no tail in the file exist. Logstash uploads corrupted file to S3 at restart but customers who download S3 file and use, they figured out the file is corrupted. This PR aims to recover the corrupted file at restart time and upload healthy GZIP file to S3.FYI: Requested the case where/when/how Logstash crashed, I will investigate once I get response
Acceptance Criteria
Logstash should always upload healthy GZIP/plain text files to AWS S3.
Solution explanation
Look at the note section for details: option 4 is recommended and implemented.
Testing
Unit testing
GzipUtilTest
unit test class covers to recover, compress and decompress success and failure cases accordingly.E2E testing
Use the S3 output setting, eg.:
kill -9 PID
where PID can be fetched withps -la | grep logstash
aws s3 cp s3://logstash-mashhur-test/path-to-file.txt.gz /local/path/to/download/
gunzip downloaded-file.txt.gz
and open the file.Additional Notes
Discussed solutions:
[Gzip recovery] Pull out the GZIP header and decompress healthy blocks till we find a tail in Plain Ruby.
GzipReader
,GzipWriter
) do not allow to access GZIP blocks/headers byte-by-byte. They always validate the file on any read actions. - NOT FEASIBLE.Logstash uses plain text and when uploading to S3 it zips the file.
output -> s3 -> size_file
users' intention is ZIP file size not ongoing plain text. Since Logstash looses accuracy, this option would be a low priority.External tool
gz-recover
will work as expected and this is an OSS. However, for the long term (eg. if security vulnerabilities found) this option would cost high since we may need to contribute or because of poor maintenance. - low priority.[Gzip recovery - RECOMMENDED, IMPLEMENTED] Pull out the GZIP header and decompress healthy blocks till we find a tail in Java.
GZIPInputStream
interface does same whenever stream starts. That means we don't necessarily look up for header or tail operations. Instead, capturingUnexpected end of ZLIB input stream
exception satisfies our condition.input-beats
,input-http
) already took initiative to move to such environment. So, this change also includes Java-Ruby mixin env setups.