janmg / logstash-input-azure_blob_storage

This is a plugin for Logstash to fetch files from Azure Storage Accounts
Other
30 stars 9 forks source link

registry.dat reset #20

Closed laurentiubanica closed 2 years ago

laurentiubanica commented 2 years ago

Hi,

The registry.dat appended to this issue shows that after 19 October it started writing data from 19 September. --> The logstash service has been restarted on 19 October, before this issue occurred.

The registry_create_policy is set to default.

I also noticed this message in logs:

[2021-10-19T07:34:31,463][INFO ][logstash.inputs.azureblobstorage][nsg-cut][bd5b8a18d8fac3940390f2673c9391a24fcead2e3d0ea73748837d175ffcc670] Skipped writing the registry because previous write still in progress, it just takes long or may be hanging! [2021-10-19T07:34:52,166][INFO ][logstash.inputs.azureblobstorage][nsg-cut][bd5b8a18d8fac3940390f2673c9391a24fcead2e3d0ea73748837d175ffcc670] Skipped writing the registry because previous write still in progress, it just takes long or may be hanging!

image

Thank you,

laurentiubanica commented 2 years ago

Also, is it possible to implement a parameter for resuming at a specified date and time ?

janmg commented 2 years ago

resuming works by checking all the files if they have grown since the last time the registry was written. The interval time defines how often the registry is saved and if it's to short and there is already a thread trying to write the registry it is just skipped.

The problem with resuming only part of the data is that NSG flowlogs use a part in their path to indicate when it was written, but they also have a timestemp inside the json. It would be possible to create some type of resuming function that can take the date in the file path into consideration, but that requires some funky date calculation, I prefer the resuming options to be either resume where you left off, or fresh starting from newly created files or everything from the start. The registry will contain all the filenames and their currently read positions.

laurentiubanica commented 2 years ago

I understand the part with resuming.

The issue in this case is that writing in registry jumped to an older date, as highlighted in the pasted image, so we started getting older events than the ones that were supposed to come and we don't know how to fix this, because the logs that came once, are coming again, and they are being duplicated.I don't understand why it jumped to an older date, after restarting the logstash service.

The size of the registry.dat is 31MB and growing, btw. Does this size affect the way you read the file for resuming ? Having a single line with lots of characters can generate an overflow ?

janmg commented 2 years ago

I don't know why it jumped. normally the plugin uses one thread to write the registry. If writing is slow it is skipped. If a second instance is started you may have two threads writing and committing in the same file. The original Azure plugin uses file locking to prevent this, but more often than not the lock wasn't cleared, preventing from writing registries all together.

The registry should reflect the actual files that are in the blobstore and match the prefix and path_filters. The values in bytes how big the file was when doing a file list and how much of the file was already read. 31MB is excessive, depending how many files you have on in the blobstor

Where the amount of files are getting too big because files are not set to automatically be removed and when blobstores are getting too slow, you can move the registry to a local path with the registry_local_path variable. Then writing to the registry is a local activity and will go a lot faster.

janmg commented 2 years ago

in line 214 I could set @registry = @newreg which would not keep the legacy of all the files that were once there. Probably going to do that when I also get to update the azure_blob_storage gemspec

laurentiubanica commented 2 years ago

Would it be possible to find a solution for multiple thread writing to the same file, other than locking the file, like looking at the maximum date in the registry ?

janmg commented 2 years ago

I pushed 0.12.0, that should limit the registry to only the actual files that are still around

janmg commented 2 years ago

0.12.3 fixes a problem with the @registry = @newreg approach where the registry doesn't save and resuming breaks