Recover corrupted GZIP files, setup of Java-Ruby mixed environment.

Issue description

When using GZIP encoding option with output to AWS S3 plugin, there are cases where Logstash may be crashed. When Logstash crashed GZIP stream is left opened and no tail in the file exist. Logstash uploads corrupted file to S3 at restart but customers who download S3 file and use, they figured out the file is corrupted. This PR aims to recover the corrupted file at restart time and upload healthy GZIP file to S3.

FYI: Requested the case where/when/how Logstash crashed, I will investigate once I get response

Acceptance Criteria

Logstash should always upload healthy GZIP/plain text files to AWS S3.

Solution explanation

Look at the note section for details: option 4 is recommended and implemented.

Testing

Unit testing

GzipUtilTest unit test class covers to recover, compress and decompress success and failure cases accordingly.

E2E testing

After pulling this change and setup local dev with Logstash-code, setup your AWS S3 bucket with file uploadable policy.

Use the S3 output setting, eg.:

input {
stdin { }
}
output {
s3 {
    region => "region"
    bucket => "logstash-mashhur-test"
    codec => "json_lines"
    canned_acl => "private"
    prefix => "test-%{+YYYY.MM.dd}"
    additional_settings => {
        "force_path_style" => true
    }
    upload_queue_size => 10
    upload_workers_count => 2
    size_file => 1024
    rotation_strategy => "size"
    encoding => gzip
    temporary_directory => "/path/to/local/temp/s3"
    validate_credentials_on_root_bucket => false
}

stdout {}
}

Run Logstash and type on the console, kill Logstash with kill -9 PID where PID can be fetched with ps -la | grep logstash

Re-run Logstash and see the S3 bucket if it uploaded, you may seen on console that Logstash warns recovered file.

[2022-06-22T15:38:43,062][WARN ][org.logstash.plugins.outputs.s3.GzipUtil][main] Corrupted file recovered, path:/path/of/file-recovered.txt

Download S3 object with aws s3 cp s3://logstash-mashhur-test/path-to-file.txt.gz /local/path/to/download/
Unzip file with gunzip downloaded-file.txt.gz and open the file.

Additional Notes

Discussed solutions:

[Gzip recovery] Pull out the GZIP header and decompress healthy blocks till we find a tail in Plain Ruby.
- Ruby GZIP interfaces (GzipReader, GzipWriter) do not allow to access GZIP blocks/headers byte-by-byte. They always validate the file on any read actions. - NOT FEASIBLE.
Logstash uses plain text and when uploading to S3 it zips the file.
- For this case, the size policy will not work properly. Saying that, when users set output -> s3 -> size_file users' intention is ZIP file size not ongoing plain text. Since Logstash looses accuracy, this option would be a low priority.
External tool
- Existing tools such as gz-recover will work as expected and this is an OSS. However, for the long term (eg. if security vulnerabilities found) this option would cost high since we may need to contribute or because of poor maintenance. - low priority.
[Gzip recovery - RECOMMENDED, IMPLEMENTED] Pull out the GZIP header and decompress healthy blocks till we find a tail in Java.
- Java has quite good input stream interfaces to access each file bytes. Initially, read the file and pulled out the header but GZIPInputStream interface does same whenever stream starts. That means we don't necessarily look up for header or tail operations. Instead, capturing Unexpected end of ZLIB input stream exception satisfies our condition.
- However, this solution requires the plugin environment to be a dynamic: Ruby-Java mixed. We have plugins (input-beats, input-http) already took initiative to move to such environment. So, this change also includes Java-Ruby mixin env setups.

logstash-plugins / logstash-output-s3