logstash-plugins / logstash-input-google_cloud_storage

Apache License 2.0
4 stars 9 forks source link

processed_db_path is not working. Duplicated data in App Search #8

Closed ivankozlovcodes closed 4 years ago

ivankozlovcodes commented 4 years ago

Hi guys,

Plugin documentation page says that setting processed_db_path should allow a user without update permissions to run the pipeline.

Config file

input {
    google_cloud_storage {
        bucket_id => "middleware-bucket"
        file_matches => "JAMA/jamanetwork.com.json"
        json_key_file => "/content/adc.json"
        codec => "json_lines"
                processed_db_path => "/tmp/logstash_process_db"
    }
}

output {
    elastic_app_search {
      api_key => "private-xxxxxx"
      engine => "aivscovid19"
      url => "xxxxx"
    }
}

Where /tmp/logstash_process_db is an empty directory. Bucket file contains ~800 json objects but App Search shows 17k documents indexed.

Running logstash command: ./logstash-7.6.2/bin/logstash -f ./configs/elasticsearch.ym -w 10 Output:

OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.headius.backport9.modules.Modules (file:/content/logstash-7.6.2/logstash-core/lib/jars/jruby-complete-9.2.9.0.jar) to method sun.nio.ch.NativeThread.signal(long)
WARNING: Please consider reporting this to the maintainers of com.headius.backport9.modules.Modules
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Sending Logstash logs to /content/logstash-7.6.2/logs which is now configured via log4j2.properties
[2020-04-23T04:47:09,271][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2020-04-23T04:47:09,593][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"7.6.2"}
[2020-04-23T04:47:12,163][INFO ][org.reflections.Reflections] Reflections took 59 ms to scan 1 urls, producing 20 keys and 40 values 
[2020-04-23T04:47:12,596][INFO ][logstash.inputs.googlecloudstorage] Using version 0.9.x input plugin 'google_cloud_storage'. This plugin should work but would benefit from use by folks like you. Please let us know if you find bugs or have suggestions on how to improve this plugin.
[2020-04-23T04:47:13,753][WARN ][org.logstash.instrument.metrics.gauge.LazyDelegatingGauge][main] A gauge metric of an unknown type (org.jruby.RubyArray) has been created for key: cluster_uuids. This may result in invalid serialization.  It is recommended to log an issue to the responsible developer/development team.
[2020-04-23T04:47:13,798][INFO ][logstash.javapipeline    ][main] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>10, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>1250, "pipeline.sources"=>["/content/configs/elasticsearch.yml"], :thread=>"#<Thread:0x3c454a85 run>"}
[2020-04-23T04:47:47,889][INFO ][logstash.inputs.googlecloudstorage][main] ProcessedDb created in: /tmp/logstash_process_db
[2020-04-23T04:47:47,907][INFO ][logstash.inputs.googlecloudstorage][main] Turn on debugging to explain why blobs are filtered.
[2020-04-23T04:47:47,917][INFO ][logstash.javapipeline    ][main] Pipeline started {"pipeline.id"=>"main"}
[2020-04-23T04:47:47,972][INFO ][logstash.inputs.googlecloudstorage][main] Fetching blobs from middleware-bucket
[2020-04-23T04:47:48,031][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2020-04-23T04:47:48,524][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
[2020-04-23T04:47:49,351][INFO ][logstash.inputs.googlecloudstorage][main] Found matching blob gs://middleware-bucket/JAMA/jamanetwork.com.jsonl
[2020-04-23T04:47:49,368][INFO ][logstash.inputs.googlecloudstorage][main] Downloading blob gs://middleware-bucket/JAMA/jamanetwork.com.jsonl
[2020-04-23T04:47:49,702][INFO ][logstash.inputs.googlecloudstorage][main] Reading events from gs://middleware-bucket/JAMA/jamanetwork.com.jsonl (temp file: /tmp/ls-in-gcs/b7e0efaf-6d0d-4f8c-b85b-fefd59669909)
[2020-04-23T04:47:50,475][ERROR][logstash.javapipeline    ][main] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::GoogleCloudStorage bucket_id=>"middleware-bucket", json_key_file=>"/content/adc.json", codec=><LogStash::Codecs::JSONLines id=>"json_lines_d7b64726-3ce2-4daf-9389-df25cca32231", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">, processed_db_path=>"/tmp/logstash_process_db", id=>"bc2541a6ffe2b74329d46fc03ac8f28e0016e053f78722ea6505d014c5c514c3", file_matches=>"JAMA/jamanetwork.com.jsonl", enable_metric=>true, interval=>60, file_exclude=>"^$", metadata_key=>"x-goog-meta-ls-gcs-input", delete=>false, unpack_gzip=>true, temp_directory=>"/tmp/ls-in-gcs">
  Error: Error listing bucket contents: xxxx@gmail.com does not have storage.objects.update access to middleware-bucket/JAMA/jamanetwork.com.jsonl.
  Exception: RuntimeError
  Stack: /content/logstash-7.6.2/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/client.rb:26:in `list_blobs'
/content/logstash-7.6.2/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:85:in `list_processable_blobs'
/content/logstash-7.6.2/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:68:in `list_download_process'
/content/logstash-7.6.2/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:61:in `block in run'
/content/logstash-7.6.2/vendor/bundle/jruby/2.5.0/gems/stud-0.0.23/lib/stud/interval.rb:20:in `interval'
/content/logstash-7.6.2/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:60:in `run'
/content/logstash-7.6.2/logstash-core/lib/logstash/java_pipeline.rb:328:in `inputworker'
/content/logstash-7.6.2/logstash-core/lib/logstash/java_pipeline.rb:320:in `block in start_input'
[2020-04-23T04:47:51,502][INFO ][logstash.inputs.googlecloudstorage][main] Fetching blobs from middleware-bucket
[2020-04-23T04:47:52,037][INFO ][logstash.inputs.googlecloudstorage][main] Found matching blob gs://middleware-bucket/JAMA/jamanetwork.com.jsonl
[2020-04-23T04:47:52,044][INFO ][logstash.inputs.googlecloudstorage][main] Downloading blob gs://middleware-bucket/JAMA/jamanetwork.com.jsonl
[2020-04-23T04:47:52,194][INFO ][logstash.inputs.googlecloudstorage][main] Reading events from gs://middleware-bucket/JAMA/jamanetwork.com.jsonl (temp file: /tmp/ls-in-gcs/b2b3f950-7d32-475d-a2c6-a1803ee726c7)
[2020-04-23T04:47:52,541][ERROR][logstash.javapipeline    ][main] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::GoogleCloudStorage bucket_id=>"middleware-bucket", json_key_file=>"/content/adc.json", codec=><LogStash::Codecs::JSONLines id=>"json_lines_d7b64726-3ce2-4daf-9389-df25cca32231", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">, processed_db_path=>"/tmp/logstash_process_db", id=>"bc2541a6ffe2b74329d46fc03ac8f28e0016e053f78722ea6505d014c5c514c3", file_matches=>"JAMA/jamanetwork.com.jsonl", enable_metric=>true, interval=>60, file_exclude=>"^$", metadata_key=>"x-goog-meta-ls-gcs-input", delete=>false, unpack_gzip=>true, temp_directory=>"/tmp/ls-in-gcs">
  Error: Error listing bucket contents: xxxx@gmail.com does not have storage.objects.update access to middleware-bucket/JAMA/jamanetwork.com.jsonl.
  Exception: RuntimeError
  Stack: /content/logstash-7.6.2/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/client.rb:26:in `list_blobs'

This pattern continues. Did we miss something in the configuration?

ivankozlovcodes commented 4 years ago

Workaround

Logstash sets meta header for files x-goog-meta-ls-gcs-input. The solution is to manually remove it before launching the logstash.

gsutil -q setmeta -h "x-goog-meta-ls-gcs-input" <gs_path> Where gs_path is your bucket path.