janmg / logstash-input-azure_blob_storage

This is a plugin for Logstash to fetch files from Azure Storage Accounts
Other
30 stars 8 forks source link

Can multiple logstash instances with plugin logstash-input-azure_blob_storage installed to read the same container read without duplicate processing? #15

Open anshuca0743 opened 3 years ago

anshuca0743 commented 3 years ago

I have a requirement to have multiple logstash instances reading from the same Azure storage account and the same container. The container has activity logs. I am running two logstash instances, and when I check the output of both instances, I find the same activity logs in both instances. I don’t want duplicate logs. Does this plugin avoid duplicate processing? or Do we need any specific config to be set to achieve the same?

janmg commented 3 years ago

The plugin does not prevent multiple readers from reading the same data, there is no synchronization between two instance and no locking. So two instances would download the same dataset unless you restrict them to each read different directories or files.

For this to be implemented, you would need pipeline to pipeline communication and the problem is that you can't detect reliably that there are two instances, they may not be able to reach eachother and writing a sync file in the storage account is also not foolproof. https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html

If I would have an infinite amount of time I would implement an optional configuration that defines a logstash reader cluster where one is configured as master that updates the registry and the slaves check if the master is still updating the registry and between them they share the workqueue where each logstash instance downloads (a part of) a file.

But this would require restructuring of the code base and the benefit doesn't seem to be worth it.