warpstreamlabs / bento

Fancy stream processing made operationally mundane. This repository is a fork of the original project before the license was changed.
https://warpstreamlabs.github.io/bento/
Other
1.07k stars 71 forks source link

add metadata blob_storage_total_files and blob_storage_file_index on azure blob storage input #89

Open mrchypark opened 3 months ago

mrchypark commented 3 months ago

This PR adds two new metadata fields to the Azure Blob Storage input:

blob_storage_total_files: The total number of files in the Azure Blob Storage container. blob_storage_file_index: The current file index being processed. These new metadata fields provide users with additional context about the progress of file processing in their Azure Blob Storage input.

Changes:

Added totalFiles and currentIndex fields to the azureBlobStorage struct. Modified the Connect method to count the total number of files. Updated the blobStorageMetaToBatch function to include the new metadata fields. Incremented the currentIndex after processing each file in the ReadBatch method. These changes will help users track the progress of their Azure Blob Storage input processing, especially when dealing with large numbers of files. The new metadata can be used for logging, monitoring, or implementing custom logic based on the processing progress.

Testing:

Tested the new metadata fields with various file counts in Azure Blob Storage containers. Verified that the blob_storage_total_files remains constant throughout the processing. Confirmed that the blob_storage_file_index increments correctly for each processed file. Please review and let me know if any further changes or clarifications are needed.