janmg / logstash-input-azure_blob_storage

This is a plugin for Logstash to fetch files from Azure Storage Accounts
Other
29 stars 8 forks source link

Exclude prefix or only get latest files? #4

Closed blacs30 closed 3 years ago

blacs30 commented 4 years ago

I'm readings logs from AKS. The container contains already logs from previous months and multiple different clusters. The plugin seems to read the earliest available logs. However I'd like to start with the latest and ignore all other logs from before x days or exclude a prefix.

For testing purposes I've added now a prefix but I don't want to change this each month. And I also don't want to create multiple inputs based on cluster in my case. resourceId= / SUBSCRIPTIONS /12345 / RESOURCEGROUPS / RGAKSXYZ / PROVIDERS / MICROSOFT.CONTAINERSERVICE / MANAGEDCLUSTERS / CLUSTER1 / y=2020 / m=05

Is there already a solution to this which I've overlooked?

janmg commented 4 years ago

I admit, it's not clear at all. But the way the plugin at the startup lists all files and compares them with the registry of files that have already been processed. The default is to resume where the registry left off. So if you shutdown logstash one hour, it would only process that hour.

registry_create_policy, :validate => ['resume','start_over','start_fresh']

As alternative for the configuration "registry_create_policy" is that you can "start_over" processing all the files and recreate the registry. Or what you actually want "start_fresh", that assumes all the previous files exist, but processing is skipped.

The registry is a file on the storage_account that contains the file path, the bytes that have been processed and the file size in bytes. What the registry_create_policy does is set the processed bytes to the filesize so effectively skipping the file, unless it grows.

You can set the value initially to start_fresh, run the pipeline so that the registry is created and then change the config to resume, or remove the config because resume is the default.

janmg commented 4 years ago

You could also use the "prefix" to exclude directories, but that would be a more static approach that would not scale well in time.

blacs30 commented 4 years ago

Thank you for the explanation. It's working as described, using "start_fresh" as policy now. You can close this issue from my POV.

janmg commented 3 years ago

Added an additional path_filters to decide which files to process, but registry_create_policy should be used to control restart behaviour. The registry can now also be kept locally