CloudTrail-optimized polling

A very common use case for S3 polling is ingest of CloudTrail logs, which have a fixed key format within a bucket: /AWSLogs/<AccountId>/CloudTrail/<region>/<YYYY>/<MM>/<DD>/<AccountId>_CloudTrail_<region>_<ISODate>_<random>.json.gz

Given this fixed structure, ingest and incremental polling can be optimized given:

Objects will not be rewritten or appended to once created
Within a given account and region, only one sub-prefix (the current date) will be written to.

The process would look something like:

Walk the prefix tree to build an initial list of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes
For each prefix in the list, spawn a poller thread:
- Walk the prefix tree to the first <YYYY>/<MM>/<DD>/ sub-prefix
- List objects within this prefix, paging through results using max_keys, next_continuation_token, and start_after until no further objects are returned
- When no further objects are returned, remove the <DD> token from current_prefix and call list_objects_v2({prefix: parent_prefix, start_after: current_prefix})
- If a new common prefix is returned, update current_prefix and begin listing objects
- If no new prefix is returned, repeat for <MM> and <YYYY> tokens
- If no new sub-prefix is discovered, store last object key as start_after and sleep for a period of time
- Re-start polling loop
Periodically check to see if new /AWSLogs/<AccountId>/CloudTrail/<region> prefixes are present and spawn new poller threads as necessary
If a poller thread's /AWSLogs/<AccountId>/CloudTrail/<region> prefix disappears, it should terminate.

Using the above logic, the lastdb file only needs to persist a small amount of information:

List of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes with:
- current_prefix (<YYYY>/<MM>/<DD>/)
- next_continuation_token (opaque)
- start_after (last object key processed)

I am happy to work on this with an optimized poller class that could be selected via configuration option. Not sure if I should fork the current master branch, or the WIP threading branch?

logstash-plugins / logstash-input-s3

CloudTrail-optimized polling #86