logstash-plugins / logstash-input-google_cloud_storage

Apache License 2.0
4 stars 9 forks source link

file_matches - why so much traffic #4

Open indera-shsp opened 5 years ago

indera-shsp commented 5 years ago

We use file_matches as described here https://www.elastic.co/guide/en/logstash/current/plugins-inputs-google_cloud_storage.html to determine which files need processing.

We are observing excessive traffic initiated by logstash - is it downloading ALL the files from the bucket every 60 seconds?

I expected it to be smart and only download file names which should not add to hundreds of megabytes every hour.

josephlewis42 commented 5 years ago

Hi @indera-shsp, the plugin should be listing the bucket every interval seconds and filtering objects based on the names before attempting to download the contents of the objects.

That could show up as several GET requests to the https://www.googleapis.com/storage/v1/b/bucket/o endpoint.

If you see Logstash downloading the contents of the file every 60 seconds that's probably a bug. The plugin should keep a locacl cache of which objects it's already processed or mark them with a label in GCS. The label is preferred (by default it's x-goog-meta-ls-gcs-input) because it's guaranteed to persist across multiple Logstash workers and sessions.

Could you expound on your use case a little bit more (average size of file, average size of name, count of objects per bucket)?

indera-shsp commented 5 years ago

The bucket contains about 20 small json files and we check every 30 seconds

tmegow commented 5 years ago
    input {
      google_cloud_storage {
        interval => 30
        bucket_id => "${GOOGLE_BUCKET_NAME}"
        json_key_file => "/sd/creds/gcp_service_account.json"
        file_matches => "transcriptions/.*json"
        prefix => "implementHere"
        codec => "json"
        metadata_key => "x-goog-meta-logstash-blocks"
      }
    }

https://github.com/logstash-plugins/logstash-input-google_cloud_storage/blob/master/lib/logstash/inputs/cloud_storage/client.rb#L22

The list function here can optionally accept an option object containing a property named "prefix" https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-storage/src/main/java/com/google/cloud/storage/Storage.java#L973.

I suspect that if Logstash allowed providing a "prefix" as a parameter (similar to how one provides file_matches) would allow Google to do pre-filtering of large lists of files and may be cheaper computationally and use less network bandwidth each list cycle.

tmegow commented 5 years ago

We have a bucket with 28716 objects (and growing). To retrieve the list of these objects plus their metadata, the resulting file is 38MB. Since our logstash interval is set to 30s, we calculate approximately 36GB every 8h. This results in significant usage increase in our ingress bandwidth, it seems like the option to provide a pre-filter pattern would eliminate some of this overhead.

indera-shsp commented 5 years ago

From reading the ruby code it looks like filtering objects is done after the code downloads the list of objects and their metadata, which is already too late to save bandwidth in our case.

josephlewis42 commented 5 years ago

It sounds like even with a prefix, we might end up back here soon if the data is going to continue growing.

It seems like the pipeline is trying to index something that's near-real time. Would one of the following approaches help?

  1. Create a second bucket for files pending processing. When a file changes in the first, use a Cloud Function to copy it over to the second and attach Logstash to the second bucket making it delete objects once it's done processing them.
  2. Read file change events off a Pub/Sub queue and only process the interesting ones with something like logstash-input-google_pubsub.

If those are too much, I'm happy to look at just adding the prefix for now if you're willing to test it so we can get a Logstash maintainer to approve the PR (they like to see at least one one real user testing it before approving a merge/release).

indera-shsp commented 5 years ago

@josephlewis42 Thank you for the prompt responses. We will evaluate the options you mentioned, but adding support for the prefix can benefit other users too :)

tmegow commented 5 years ago

@josephlewis42 if there is a PR enabling the use of a prefix pre-filter, we will happily test it

indera-shsp commented 4 years ago

@josephlewis42 how difficult would be to fix this issue for somebody not familiar with the code base?

josephlewis42 commented 4 years ago

@indera-shsp I'm taking a stab at it right now. The codebase is a bit hairy because it's Java mixed with Ruby. Our hope was full Java because then type checks and the like are easy but I think those plans have been stalled upstream.

josephlewis42 commented 4 years ago

@indera-shsp or @tmegow I built a version with the fix and have it published here: https://storage.googleapis.com/logstash-prereleases/logstash-input-google_cloud_storage-0.12.0-java.gem for testing. If things look good, would you mind leaving your remarks in #7 ?

Here are the docs for the new field:

[id="plugins-{type}s-{plugin}-file_prefix"]
===== `file_prefix`

added[0.12.0]

  * Value type is <<string,string>>
  * Default is: ``

A prefix filter applied server-side. Only files starting with this prefix will
be fetched from Cloud Storage. This can be useful if all the files you want to
process are in a particular folder and want to reduce network traffic.