Is your feature request related to a problem? Please describe.
For pull based sources that perform bulk reading like S3 scan or the OpenSearch source that is in PR. As a user, I would like a mechanism to track which data has been read and processed. This could include if data is dropped, a node in my data prepper cluster becomes unresponsive
Describe the solution you'd like
An audit log comes to mind. This log would contain a list of data processing events related to docs or indices or some metadata determine by the source. These logs could be used to determine the exact time frame a set of data was pulled into data prepper.
Describe alternatives you've considered (Optional)
Metrics tracking the completion percentage for a scan
Improving existing logs by adding an Audit tag to the message which tracks relevant data processing events.
Not including audit logs. I am not sure if this makes sense in Data Prepper. Audit logging would be a new requirement that we may have to enforce on every plugin if we wanted to track data through a pipeline.
This idea is very vague and contains a lot of ambiguity with alternatives. We need to tighten down the requirements and figure what exactly we want to support.
Is your feature request related to a problem? Please describe. For pull based sources that perform bulk reading like S3 scan or the OpenSearch source that is in PR. As a user, I would like a mechanism to track which data has been read and processed. This could include if data is dropped, a node in my data prepper cluster becomes unresponsive
Describe the solution you'd like An audit log comes to mind. This log would contain a list of data processing events related to docs or indices or some metadata determine by the source. These logs could be used to determine the exact time frame a set of data was pulled into data prepper.
Describe alternatives you've considered (Optional)