Open indera-shsp opened 5 years ago
Hi @indera-shsp, the plugin should be listing the bucket every interval
seconds and filtering objects based on the names before attempting to download the contents of the objects.
That could show up as several GET
requests to the https://www.googleapis.com/storage/v1/b/bucket/o
endpoint.
If you see Logstash downloading the contents of the file every 60 seconds that's probably a bug. The plugin should keep a locacl cache of which objects it's already processed or mark them with a label in GCS. The label is preferred (by default it's x-goog-meta-ls-gcs-input
) because it's guaranteed to persist across multiple Logstash workers and sessions.
Could you expound on your use case a little bit more (average size of file, average size of name, count of objects per bucket)?
The bucket contains about 20 small json files and we check every 30 seconds
input {
google_cloud_storage {
interval => 30
bucket_id => "${GOOGLE_BUCKET_NAME}"
json_key_file => "/sd/creds/gcp_service_account.json"
file_matches => "transcriptions/.*json"
prefix => "implementHere"
codec => "json"
metadata_key => "x-goog-meta-logstash-blocks"
}
}
The list function here can optionally accept an option object containing a property named "prefix" https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-storage/src/main/java/com/google/cloud/storage/Storage.java#L973.
I suspect that if Logstash allowed providing a "prefix" as a parameter (similar to how one provides file_matches) would allow Google to do pre-filtering of large lists of files and may be cheaper computationally and use less network bandwidth each list cycle.
We have a bucket with 28716 objects (and growing). To retrieve the list of these objects plus their metadata, the resulting file is 38MB. Since our logstash interval
is set to 30s, we calculate approximately 36GB every 8h. This results in significant usage increase in our ingress bandwidth, it seems like the option to provide a pre-filter pattern would eliminate some of this overhead.
From reading the ruby code it looks like filtering objects
is done after the code downloads the list of objects and their metadata, which is already too late to save bandwidth in our case.
It sounds like even with a prefix, we might end up back here soon if the data is going to continue growing.
It seems like the pipeline is trying to index something that's near-real time. Would one of the following approaches help?
If those are too much, I'm happy to look at just adding the prefix for now if you're willing to test it so we can get a Logstash maintainer to approve the PR (they like to see at least one one real user testing it before approving a merge/release).
@josephlewis42 Thank you for the prompt responses. We will evaluate the options you mentioned, but adding support for the prefix
can benefit other users too :)
@josephlewis42 if there is a PR enabling the use of a prefix pre-filter, we will happily test it
@josephlewis42 how difficult would be to fix this issue for somebody not familiar with the code base?
@indera-shsp I'm taking a stab at it right now. The codebase is a bit hairy because it's Java mixed with Ruby. Our hope was full Java because then type checks and the like are easy but I think those plans have been stalled upstream.
@indera-shsp or @tmegow I built a version with the fix and have it published here: https://storage.googleapis.com/logstash-prereleases/logstash-input-google_cloud_storage-0.12.0-java.gem for testing. If things look good, would you mind leaving your remarks in #7 ?
Here are the docs for the new field:
[id="plugins-{type}s-{plugin}-file_prefix"]
===== `file_prefix`
added[0.12.0]
* Value type is <<string,string>>
* Default is: ``
A prefix filter applied server-side. Only files starting with this prefix will
be fetched from Cloud Storage. This can be useful if all the files you want to
process are in a particular folder and want to reduce network traffic.
We use
file_matches
as described here https://www.elastic.co/guide/en/logstash/current/plugins-inputs-google_cloud_storage.html to determine which files need processing.We are observing excessive traffic initiated by logstash - is it downloading ALL the files from the bucket every 60 seconds?
I expected it to be smart and only download file names which should not add to hundreds of megabytes every hour.