logstash-plugins / logstash-input-google_cloud_storage

Apache License 2.0
4 stars 9 forks source link

gzip files with 'dX %' in it are not detected by mime type #14

Open thijsvandergugten opened 3 years ago

thijsvandergugten commented 3 years ago

Logstash information:

  1. Logstash version: 7.11.1
  2. Logstash installation source: docker (docker.elastic.co/logstash/logstash-oss:7.11.1)
  3. How is Logstash being run: docker
  4. How was the Logstash Plugin installed: with the line RUN logstash-plugin install logstash-input-google_cloud_storage in the Dockerfile

JVM (e.g. java -version): OpenJDK 64-Bit Server VM 11.0.8+10

OS version: Ubuntu 18.04 LTS

Description of the problem including expected versus actual behavior:

If a file contains the magic string 'dX %', it is not processed, because it is detected as audio/vnd.dts.hd instead of application/gzip. In https://github.com/logstash-plugins/logstash-input-google_cloud_storage/blob/master/lib/logstash/inputs/cloud_storage/file_reader.rb#L26, the snippet

def self.gzip?(filename)
  magic = MimeMagic.by_magic(::File.open(filename))
  magic ? magic.subtype == "gzip" : false
end

uses code from https://github.com/mimemagicrb/mimemagic/blob/master/lib/mimemagic.rb#L84. As far as I can see, whenever a file contains the magic string 'dX %', it is recognized as audio/vnd.dts.hd which is not equal to a gzip-type.

Steps to reproduce:

  1. Try to parse a gzip-document which contains the string 'dX %' (the magic string for the filetype audio/vnd.dts.hd)
  2. Observe the logging below.

Logs:

[2020-10-02T16:16:53,500][ERROR][logstash.javapipeline    ][main][db1dc633a0e5eeb4e59aa152d277f51da22d98b38e631f12070531c57eaeabe8] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::GoogleCloudStorage bucket_id=>"...", json_key_file=>"...", codec=><LogStash::Codecs::JSONLines id=>"json_lines_b7600074-64a5-4ec4-b5b6-ab34acb20332", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">, interval=>300, id=>"db1dc633a0e5eeb4e59aa152d277f51da22d98b38e631f12070531c57eaeabe8", delete=>true, file_matches=>".*log.gz", enable_metric=>true, file_exclude=>"^$", metadata_key=>"x-goog-meta-ls-gcs-input", unpack_gzip=>true, temp_directory=>"/tmp/ls-in-gcs">
  Error: invalid byte sequence in UTF-8
  Exception: ArgumentError
  Stack: org/jruby/RubyString.java:4225:in `split'
org/logstash/common/BufferedTokenizerExt.java:78:in `extract'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-json_lines-3.0.6/lib/logstash/codecs/json_lines.rb:40:in `decode'
/usr/share/logstash/logstash-core/lib/logstash/codecs/delegator.rb:62:in `block in decode'
org/logstash/instrument/metrics/AbstractSimpleMetricExt.java:65:in `time'
org/logstash/instrument/metrics/AbstractNamespacedMetricExt.java:64:in `time'
/usr/share/logstash/logstash-core/lib/logstash/codecs/delegator.rb:61:in `decode'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:111:in `extract_event'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:97:in `block in download_and_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/file_reader.rb:33:in `block in read_plain_lines'
org/jruby/RubyIO.java:3329:in `each'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/file_reader.rb:32:in `read_plain_lines'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/file_reader.rb:20:in `read_lines'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:96:in `block in download_and_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/blob_adapter.rb:72:in `with_downloaded'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:93:in `download_and_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:70:in `block in list_download_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:86:in `block in list_processable_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/client.rb:23:in `block in list_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/client.rb:22:in `list_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:85:in `list_processable_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:68:in `list_download_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:61:in `block in run'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/stud-0.0.23/lib/stud/interval.rb:20:in `interval'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:60:in `run'
/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:346:in `inputworker'
/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:337:in `block in start_input'
daxxog commented 2 years ago

FWIW I was able to produce a dirty Dockerfile patch which appears to resolve this issue.

RUN cat /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.rb | \
    sed 's/common_types = \[/common_types = \["application\/gzip",/g' | \
    tee /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.patched.rb \

    && mv /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.patched.rb \
    /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.rb \
;

Related to https://github.com/mimemagicrb/mimemagic/issues/36, I think it has to do with "priority" of mime magic checking. In the context of the usage in this logstash plugin it's either gzip or it's not, so my patch just puts gzip at the top of the "common types" list. An issue probably should be opened in mimemagicrb/mimemagic regarding this.