s3 input with cloudtrail codec not working with gzipped files

kevinpauli commented 8 years ago

Not sure if this is a problem with s3 input plugin or the cloudtrail codec... but I can't seem to get the s3 input with cloudtrail codec working if the file is gzipped (which is the default for cloudtrail). It does work if I download the file, unzip it, and upload it back into a different S3 bucket.

logstash 2.2.2 logstash-input-s3 2.0.4 logstash-codec-cloudtrail 2.0.2

I started out with a normal cloudtrail bucket created by AWS, and a simple config like this:

input {   
    s3 {
        bucket => "cloudtrail-logs" 
        codec => cloudtrail {}
    }
}

output {
    stdout { codec => rubydebug }
}

When I run logstash with --debug, I see this:

S3 input: Adding to objects[] {:key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"116", :method=>"list_new_files"}
S3 input processing {:bucket=>"cloud-analytics-platform-cloudtrail-logs", :key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"150", :method=>"process_files"}
S3 input: Download remote file {:remote_key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :local_filename=>"/var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"344", :method=>"download_remote_file"}
Processing file {:filename=>"/var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"182", :method=>"process_local_log"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}

And it just keeps printing that last line over and over and never does anything else. If I go look in /var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/ I do indeed see a gzipped file, blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz .

Now, if I unzip this file, and create myself a test bucket, and put the unzipped file into the test bucket, and run logstash pointing at my test bucket, it works fine!

According to the docs at https://www.elastic.co/guide/en/logstash/current/plugins-inputs-s3.html , if the filename ends in .gz then the s3 input should handle it automatically.

hrzbrg commented 8 years ago

I also experience this behaviour and would love some input on this. How can we fix it? The S3 input handles our RDS Logs from AWS just fine.

levimccormick commented 8 years ago

I'm also having the same issue, with logstash-input-s3 version 3.1.1.

DewaldV commented 6 years ago

So having run into this issue, I thought I'd share my findings.

In my case the file was finally processed but took over an hour. After some debugging and stack trace dumping I found that the S3 input was getting stuck on io.each_line in S3.rb:258

I did some testing on that block of code that inflates the gzip file and discovered that replacing .each_line with .read solved the problem instantly. I then proceeded to test the current code and a version with .read on a newer version of JRuby (specifically the version shipped with Logstash 6.0.1) and found the problem had disappeared.

So for those experiencing this problem it is resolved, incidentally, by upgrading to Logstash 6.0.1.

Hope the information helps with solving this in the future.

yaauie commented 6 years ago

@DewaldV since you mentioned having a fix, would you care to submit a patch and provide example inputs? Namely, it'd be interesting to see if your gzip files are large chunk or chunk-per-line.

From what I can tell, the current use of Zlib::GzipReader#each_line is meant to provide support for partial processing of a gzip chunk (e.g., when the file is a result of gzipping a 5GB log file), and while it may be faster to use Zlib::GzipReader#read in some cases (e.g., when the file is the result of appending each line as its own gzipped chunk), we do need to consider both cases.

It may be worth exposing the option to process whole gzip chunks as a configurable parameter.

yaauie commented 6 years ago

this may be resolved with #127 and the 3.2.0 release of logstash-input-s3, which resolves some performance and memory issues when using gzip.

bin/logstash-plugins update logstash-input-s3

logstash-plugins / logstash-input-s3

s3 input with cloudtrail codec not working with gzipped files #78