logstash-plugins / logstash-input-s3

Apache License 2.0
57 stars 150 forks source link

Allow multiple logstash instances #188

Closed anas-aso closed 2 years ago

anas-aso commented 4 years ago

What this PR does / why we need it

To allow scaling logs processing from a single S3 bucket by running multiple Logstash instances with S3 Input Plugin (an open issue since 2015 : https://discuss.elastic.co/t/multiple-logstash-docker-containers-sharing-an-s3-input/36077/2)

Special notes for the reviewers

Using the current version of this plugin, if you run multiple instances they will all end up processing the same file due to the fact that the list of S3 objects is always "sorted" before starting the processing.

This PR allows the user to decide whether to sort the files (by name/s3 object key) before starting the processing (current behavior) or shuffle the list of objects to minimize the possibility of contention between multiple instances.

The main change in this PR is small : https://github.com/logstash-plugins/logstash-input-s3/commit/4dc4e9d45b966454da02127eaeafd910cac3e9b5 which is an optimistic locking ... kind of. All other changes are for error handling or to make the change introduced by this PR configurable.

Freyert commented 4 years ago

How do the multiple logstashes coordinate? Do they mark files as processed in some way so we don't reprocess them?

anas-aso commented 4 years ago

@Freyert They don't coordinate. If you use delete (plugin configs), then processed files won't be available by processing because they are moved already. If you still want to keep processed files you can use backup_to_bucket.

anas-aso commented 4 years ago

@robbavey @yaauie can you please have a look ? I believe this is a fix that will help many people (at least it did for us).

resworld commented 4 years ago

Hey @anas-aso, I am trying to use your approach for having s3 input plugin running on two logstash instances and process files from single s3 bucket. So far we couldn't make it work as we expected and this is what we tried:

  1. We tried to set up sort_processed_files = false in additional settings section, but s3 input plugin failed to run with:

invalid configuration option `:sort_processed_files'

Is there any other way to specify this option?

  1. We have changed sort_processed_files default value to false in s3.rb file located in s3 input plugin, and now seems that change is successful. But after that change entries in ELK keep grows, it's not only duplicate, but a single new element in s3 bucket is written for more than 10 times (and continue to grow) in ELK.

  2. We tried to use delete => true, but it doesn't have much change. When it is enabled I can see a message in logstash logs:

[2020-04-28T08:18:17,522][WARN ][logstash.inputs.s3 ] S3 input: Remote file not available anymore {:remote_key=>"s3-bucket/element_name_XXXX"}

But element's value is duplicate in ELK (more than two times if sort_proccesed_files is false) and I can see element isn't deleted from s3 bucket.

Can you help me what I have missed?

anas-aso commented 4 years ago

@resworld

We tried to set up sort_processed_files = false in additional settings section, but s3 input plugin failed to run with: invalid configuration option `:sort_processed_files' Is there any other way to specify this option?

That's because this change is not merged ... seems like the maintainers are not interested in it. But you can build it on your own using my fork : https://github.com/anas-aso/logstash-input-s3

Can you help me what I have missed?

Here is my running config and I did see issues similar to what you mentioned.

input {
  s3 {
    [...]
    Your AWS credentials & region here
    [...]
    "bucket" => "LB_Logs_Bucket_Name"
    "prefix" => "If_You_Have_A_Configured_Prefix/"
    "delete" => true
    "backup_to_bucket" => "You_Can_Use_The_Same_Input_Bucket"
    "backup_add_prefix" => "But_With_Different_Prefix_That_The_Input_One/"
    "interval" => 60
    "codec" => plain
    "sort_processed_files" => false
  }
  beats {
    port => 5044
  }
}

I agree, the PR description doesn't mention that you have to set delete => true for this change to work properly. Otherwise, the multiple logstash instances, will keep processing the same document(s) over and over. I was expecting some interaction from the plugin maintainers ... and maybe adjust the doc as well 🤷

resworld commented 4 years ago

Thank you @anas-aso , problem was in that I didn't specify a backup prefix and permission in the s3 bucket was only for Listening and Reading the elements, but we need access to move them also. Now plugin work fine with two instances of logstash.

kaisecheng commented 3 years ago

Thank you for submitting the PR. I think this is a simple way to share the workload among instances if duplication is not a concern. However, keeping minimal duplication is important for the plugin. I would suggest using prefix to spread the load to instances or enhance prefix to allow regex to filter the target files, so each instance work on a different set of files.

anas-aso commented 3 years ago

I would suggest using prefix to spread the load to instances or enhance prefix to allow regex to filter the target files, so each instance work on a different set of files.

@kaisecheng can you elaborate please ? I am not sure how prefix can help to spread the load.

anas-aso commented 2 years ago

I don't need this anymore.