Closed anas-aso closed 2 years ago
How do the multiple logstashes coordinate? Do they mark files as processed in some way so we don't reprocess them?
@Freyert They don't coordinate. If you use delete
(plugin configs), then processed files won't be available by processing because they are moved already. If you still want to keep processed files you can use backup_to_bucket
.
@robbavey @yaauie can you please have a look ? I believe this is a fix that will help many people (at least it did for us).
Hey @anas-aso, I am trying to use your approach for having s3 input plugin running on two logstash instances and process files from single s3 bucket. So far we couldn't make it work as we expected and this is what we tried:
sort_processed_files
= false
in additional settings section, but s3 input plugin failed to run with:invalid configuration option `:sort_processed_files'
Is there any other way to specify this option?
We have changed sort_processed_files
default value to false in s3.rb file located in s3 input plugin, and now seems that change is successful. But after that change entries in ELK keep grows, it's not only duplicate, but a single new element in s3 bucket is written for more than 10 times (and continue to grow) in ELK.
We tried to use delete
=> true, but it doesn't have much change. When it is enabled I can see a message in logstash logs:
[2020-04-28T08:18:17,522][WARN ][logstash.inputs.s3 ] S3 input: Remote file not available anymore {:remote_key=>"s3-bucket/element_name_XXXX"}
But element's value is duplicate in ELK (more than two times if sort_proccesed_files
is false
) and I can see element isn't deleted from s3 bucket.
Can you help me what I have missed?
@resworld
We tried to set up sort_processed_files = false in additional settings section, but s3 input plugin failed to run with: invalid configuration option `:sort_processed_files' Is there any other way to specify this option?
That's because this change is not merged ... seems like the maintainers are not interested in it. But you can build it on your own using my fork : https://github.com/anas-aso/logstash-input-s3
Can you help me what I have missed?
Here is my running config and I did see issues similar to what you mentioned.
input {
s3 {
[...]
Your AWS credentials & region here
[...]
"bucket" => "LB_Logs_Bucket_Name"
"prefix" => "If_You_Have_A_Configured_Prefix/"
"delete" => true
"backup_to_bucket" => "You_Can_Use_The_Same_Input_Bucket"
"backup_add_prefix" => "But_With_Different_Prefix_That_The_Input_One/"
"interval" => 60
"codec" => plain
"sort_processed_files" => false
}
beats {
port => 5044
}
}
I agree, the PR description doesn't mention that you have to set delete => true
for this change to work properly. Otherwise, the multiple logstash instances, will keep processing the same document(s) over and over.
I was expecting some interaction from the plugin maintainers ... and maybe adjust the doc as well 🤷
Thank you @anas-aso , problem was in that I didn't specify a backup prefix and permission in the s3 bucket was only for Listening and Reading the elements, but we need access to move them also. Now plugin work fine with two instances of logstash.
Thank you for submitting the PR. I think this is a simple way to share the workload among instances if duplication is not a concern. However, keeping minimal duplication is important for the plugin. I would suggest using prefix
to spread the load to instances or enhance prefix
to allow regex to filter the target files, so each instance work on a different set of files.
I would suggest using prefix to spread the load to instances or enhance prefix to allow regex to filter the target files, so each instance work on a different set of files.
@kaisecheng can you elaborate please ? I am not sure how prefix can help to spread the load.
I don't need this anymore.
What this PR does / why we need it
To allow scaling logs processing from a single S3 bucket by running multiple Logstash instances with S3 Input Plugin (an open issue since 2015 : https://discuss.elastic.co/t/multiple-logstash-docker-containers-sharing-an-s3-input/36077/2)
Special notes for the reviewers
Using the current version of this plugin, if you run multiple instances they will all end up processing the same file due to the fact that the list of S3 objects is always "sorted" before starting the processing.
This PR allows the user to decide whether to sort the files (by name/s3 object key) before starting the processing (current behavior) or shuffle the list of objects to minimize the possibility of contention between multiple instances.
The main change in this PR is small : https://github.com/logstash-plugins/logstash-input-s3/commit/4dc4e9d45b966454da02127eaeafd910cac3e9b5 which is an optimistic locking ... kind of. All other changes are for error handling or to make the change introduced by this PR configurable.