Alternate SinceDB implementation

brandond commented 8 years ago

Here's my first shot at a patch to address #86. It changes the sincedb and poller functionality to store a 'marker' so that the poller remembers where it was in the object list last time it ran, and pick up there again. It does this by holding a rolling tail list of objects that have been queued for processing, and remembering the earliest one in the list that has been successfully processed. This is where it resumes when listing bucket objects.

This really only works for buckets where objects can be counted on to show up at the 'end' of the key space within a given prefix. This is guaranteed to be true for things like CloudTrail, but probably not other use cases. I'll probably continue to enhance this PR to allow for selectable polling strategies that could include:

Scan all contents; remembering those that have been processed (default with threading fork)
Tail marker (added by this PR)
Multi-prefix tail marker (may also be added by this PR)
Passive SNS/SQS (tbd)

ph commented 7 years ago

Just got back fro vacation will check asap.

brandond commented 7 years ago

@ph I've been running this for a few weeks now, watching a couple S3 buckets containing CloudTrail logs and (more recently) Firehose exports from CloudWatch Logs. It works a lot differently than it used to, but runs with much lower overhead than it did previously.

I haven't even looked at any of the tests; I can try to get those cleaned up at some point.

logstash-plugins / logstash-input-s3

Alternate SinceDB implementation #89