logstash-plugins / logstash-output-s3

Apache License 2.0
58 stars 151 forks source link

Recover corrupted GZIP files, setup of Java-Ruby mixed environment. #249

Closed mashhurs closed 2 years ago

mashhurs commented 2 years ago

Issue description

When using GZIP encoding option with output to AWS S3 plugin, there are cases where Logstash may be crashed. When Logstash crashed GZIP stream is left opened and no tail in the file exist. Logstash uploads corrupted file to S3 at restart but customers who download S3 file and use, they figured out the file is corrupted. This PR aims to recover the corrupted file at restart time and upload healthy GZIP file to S3.

FYI: Requested the case where/when/how Logstash crashed, I will investigate once I get response

Acceptance Criteria

Logstash should always upload healthy GZIP/plain text files to AWS S3.

Solution explanation

Look at the note section for details: option 4 is recommended and implemented.

Testing

Unit testing

E2E testing

Additional Notes

Discussed solutions:

  1. [Gzip recovery] Pull out the GZIP header and decompress healthy blocks till we find a tail in Plain Ruby.

    • Ruby GZIP interfaces (GzipReader, GzipWriter) do not allow to access GZIP blocks/headers byte-by-byte. They always validate the file on any read actions. - NOT FEASIBLE.
  2. Logstash uses plain text and when uploading to S3 it zips the file.

    • For this case, the size policy will not work properly. Saying that, when users set output -> s3 -> size_file users' intention is ZIP file size not ongoing plain text. Since Logstash looses accuracy, this option would be a low priority.
  3. External tool

    • Existing tools such as gz-recover will work as expected and this is an OSS. However, for the long term (eg. if security vulnerabilities found) this option would cost high since we may need to contribute or because of poor maintenance. - low priority.
  4. [Gzip recovery - RECOMMENDED, IMPLEMENTED] Pull out the GZIP header and decompress healthy blocks till we find a tail in Java.

    • Java has quite good input stream interfaces to access each file bytes. Initially, read the file and pulled out the header but GZIPInputStream interface does same whenever stream starts. That means we don't necessarily look up for header or tail operations. Instead, capturing Unexpected end of ZLIB input stream exception satisfies our condition.
    • However, this solution requires the plugin environment to be a dynamic: Ruby-Java mixed. We have plugins (input-beats, input-http) already took initiative to move to such environment. So, this change also includes Java-Ruby mixin env setups.
xiaomaisuii commented 1 year ago

Hello, I saw that the fix code #249 has been merged into the logstash-plugins/logstash-integration-aws repository. Could you let me know when this fix will be merged into the logstash-plugins/logstash-output-s3 repository, or if it has already been fixed in a particular version of logstash?