janmg / logstash-input-azure_blob_storage

This is a plugin for Logstash to fetch files from Azure Storage Accounts
Other
29 stars 8 forks source link

Unable to configure multiple paths in logstash-input-azure_blob_storage-0.11.1 #2

Closed pinochioze closed 3 years ago

pinochioze commented 4 years ago

Dear everyone, I try to configure multiple paths with "prefix" and treat it like list type as below: prefix => [ "path/to/first/", "path/to/second/" ]

Although I try to change the "prefix" in the code like below but its still not work, config :prefix, :validate => :array, :default => [], :required => false

I am just very new in Ruby so I can't do more. Your plug-in is so amazing because I have tried and debug many times with logstash-input-azureblob plug-in with the huges log file in Application Insights Is there a problem if we use both logstash-input-azure_blob_storage and logstash-input-azure_blob access to a storage account at the same time? Would you please fix this problems? Thank you very much.

pinochioze commented 4 years ago

Just inform that the "prefix" does not work with the paths have the extension and regex like below prefix => "path/to/first/ *.log" or prefix => "path/to/*/ .log"

janmg commented 4 years ago

I'll have a look on how to implement glob filtering, so that the list_blobs can skip directories or files. I could use such a filter myself.

At the moment the prefix is nothing more than just a feature of the Azure azure-storage-ruby in the list_blobs ... I don't fully understand their description "Filters the results to return only containers whose name begins with the specified prefix". I thought it would use the prefix as a starting directory so somthing like "path/to/" to look into the sudirectories. (not containers?). I guess I'll look into adding an optional config parameter to exclude/include on patterns, when I have some time

https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/blob_service.rb#L199

pinochioze commented 4 years ago

Thanks Jan for your reply. So far, your plugin is working like a charm with huge data in a container. how can you clarify the blob with different path in logstash filter because there is no field show the path on your plug in? Currently I filter the data base on some content of their fields.

I just wonder why the registry.dat has to be stored on Azure blob, why don't we store it in the local logstash server like other plugin /user/share/logstash/data/....

janmg commented 4 years ago

For the original issue on filtering I have copied the path_filters feature from the azure diagnostics version of the azure diagnostics blob storage plugin. I used the diagnostics version as inspiration for my plugin after I figured out that I could not just make a patch to fix my issues. My Ruby experience started with the storage-blobs-ruby-quickstart to help fetch partial blob blocks. Some reasons for the higher performance of my version is has to do with the listing of blobs in relation to the interval. The registry is written in a separate fire and forget thread. But for 0.11.0 EyeFitU fixed the closing of the sessions to reduce the memory footprint.

In the latest commit, I added path_filters with default matcher one glob */, which goes through all subdirectories. Examples can be found in the rubydoc. https://ruby-doc.org/core-2.6.2/File.html#method-c-fnmatch

If you have some time to build the plugin and test the filter I'd like to hear your feedback. Otherwise I do it myself when I have time again. I only tested that the fix doesn't break my pipelines.

In the latest commit I've enabled temporary 3 log-lines to display the files that are 1. fetching the filelist, 2. compare it to the registry and 3. when the processing. I'll change those loggers to debug before I release 0.11.2 as they are overflowing my logs in normal opearion ... but for now I haven't filtered the default debug information from logstash.

@logger.info("1: list_blobs #{blob.name} #{offset} #{length}") @logger.info("2: adding offsets: #{name} #{off} #{file[:length]}") @logger.info("3: processing #{name} from #{file[:offset]} to #{file[:length]}") @logger.info(@pipe_id+" partial file #{name} from #{file[:offset]} to #{file[:length]}")

When I have some time I'll look into your other feature requests. I appreciate you thinking about improvements, they make sense. local registry wasn't used by the original azure diagnostics plugin because it was intended to sync two logstash instances reading at the same time, with hindsight I consider it a bad idea due to file concurrency issues, but I didn't switch or made it optional yet.

janmg commented 4 years ago

path_filters introduced in 0.11.2 it's copied from the Azure diagnostics.

Another one that may be useful is the debug_until. It prints which file is processed until the amount of messages is higher than the configured amount ... This is useful if you want to run the plugin on log.level info but want to see if it started properly, without overflowing the logs.

Saving the registry on the logstash system or in elasticsearch is something I'll look into next year. WIth multiple logstash instances or multiple logstash pipelines it can get messy really fast. so I need a good strategy.

pinochioze commented 4 years ago

Thanks Jan, I will make some tests today with my data and let you know soon.

pinochioze commented 4 years ago

Hi Jan, after checking the new version 0.11.2, the filter_path and debug_until configuration is worked as expectiation. However there is able to have a mistake in your code. In the line 338 on the list_blob function, the for loop seems to wrong, it makes only read maximum 15000 blobs in a container. The origin is "loop do ..... end". Would you please take a look on that. Thank you so much again, the combination between prefix and path_filter will make the plugin just get the blobs in the prefix and reduce the number of blobs have to read for each loop.

pinochioze commented 4 years ago

Hi Jan, there is an issue when I set the codec => json in the input "azure_blob_storage" of logstash (instead of codec => line ). The issue happens on the line 390 of the "_def learnencapsulation" function because there is no element in the BLOCKs array so the blocks.first.size can not be handled. There seems something happens with _list_blobblocks function, I had checked but the BLOCK size is always 0 for all my blobs. Would you please take a look on that. Thank you very much.

janmg commented 4 years ago

FYI, I pushed 0.11.3 to fix the loop problem on the nextmarker that stops looping after 3 times so reading a maximum of 15000 blobs. This 1..3 loop is to counter a faraday issue and I shouldn't have removed the original loop. The learning of tail and head wouldn't work without blocks so that has to be set manually ... I put it on my TODO list to test the if there are blocks in the file.

janmg commented 3 years ago

In 0.11.6 I added a config file to skip the learning of the json head and tail. The 1..3 loop seems to work better and I fixed a bug earlier that lists too many times.