handling ThrottlingException: Rate exceeded

tomyan commented 9 years ago

Great library, thanks! We've found that when we test it with a high volume of logs, we get the following exception:

ThrottlingException: Rate exceeded
    at Request.extractError (/usr/src/app/node_modules/aws-sdk/lib/protocol/json.js:43:27)
    at Request.callListeners (/usr/src/app/node_modules/aws-sdk/lib/sequential_executor.js:100:18)
    at Request.emit (/usr/src/app/node_modules/aws-sdk/lib/sequential_executor.js:77:10)
    at Request.emit (/usr/src/app/node_modules/aws-sdk/lib/request.js:604:14)
    at Request.transition (/usr/src/app/node_modules/aws-sdk/lib/request.js:21:12)
    at AcceptorStateMachine.runTo (/usr/src/app/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /usr/src/app/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/usr/src/app/node_modules/aws-sdk/lib/request.js:22:9)
    at Request.<anonymous> (/usr/src/app/node_modules/aws-sdk/lib/request.js:606:12)
    at Request.callListeners (/usr/src/app/node_modules/aws-sdk/lib/sequential_executor.js:104:18)

Please can you let me know if there's a way to handle these currently, or otherwise if you'd be interested in a PR to address this.

Thanks

Tom

dberesford commented 9 years ago

Hi Tom,

you can pass some throttling related args:

'bulkIndex': 10,
'timeout': 20

bulkIndex will batch up X number of log events before sending to cloudwatch, timeout will batch up log events and send them on every X number of seconds. You can use individually or both together. Let us know if this helps.

tomyan commented 9 years ago

Hi Damian

Thanks for the response. I'd like to find a way to deal with it that doesn't change the latency of messages being shipped when the volume is low, and doesn't crash the process if the volume is higher than expected. Ideally there would be a way to buffer the messages and try resending them with some backoff. It would also be useful to be able to configure how much data should be buffered in this way before it's dropped on the floor, plus be able to subscribe to an event for when this happens (I expect we'd want to trigger an alarm if we start falling behind or losing log data). This is starting to sound like a wish list, but I was wondering if these would be features you'd be interested in?

Thanks

Tom

dberesford commented 9 years ago

Hi Tom - sounds interesting for sure, there's probably some algorithm out there for dealing with this kind of stuff or if not it sounds like a fun on to write. Out of curiosity, under heavy load how much log data you guys sending to cloudwatch?

mcollina commented 9 years ago

If we override _writev, we can get automatic batching done by the Writable stream itself. Thanks @mafintosh for this trick :).

I can have a look if you want @dberesford.

dberesford commented 9 years ago

@mcollina you know how much I love stream's witchcraft :-)

tomyan commented 9 years ago

@dberesford probably not that much, but losing logs when the volume is high is a DoS risk.

timkoopmans commented 9 years ago

So a few observations here..

The bulk index (in fact all events) are stored on a circular buffer at the moment using Deque

To that end with a constant rate then it should throttle accordingly with -b e.g.

Git/90kts » while true; do echo test; sleep 0.1; done | cloudwatchlogs -a $KEY -s $SECRET -r us-west-2 -g testing -t bulk -b 50
flushing event queue size 50
flushing event queue size 50
...

It should also throttle with a timeout using -o e.g.

Git/90kts » while true; do echo test; sleep 0.1; done | cloudwatchlogs -a $KEY -s $SECRET -r us-west-2 -g testing -t bulk -o 5
timeout reached
flushing event queue size 49
timeout reached
flushing event queue size 47
...

If you use a combination of timeout and bulk it should switch precedence based on whatever occurs first (bulk index size reached or timeout occurs) e.g.

Git/90kts » while true; do echo test; sleep 0.1; done | cloudwatchlogs -a $KEY -s $SECRET -r us-west-2 -g testing -t bulk -b 50 -o 3
timeout reached
flushing event queue size 0
flushing event queue size 50
timeout reached
flushing event queue size 39
timeout reached
flushing event queue size 28

So you get the best of both worlds there for a slow trickle of logs or higher volume. So not sure how you're still getting rate throttling with this combo @tomyan ?

Point to note though, if it just gets slammed with a ton of events at once I believe the circular buffer won't grow quickly enough, but also means you'll lose events e.g.

Git/90kts » cat 500_lines.log| cloudwatchlogs -a $KEY -s $SECRET -r us-west-2 -g testing -t bulk -o 5
timeout reached
flushing event queue size 2
timeout reached
flushing event queue size 0
...

It would be nice to catch that condition somehow so you know logs are missing.

I haven't done any testing beyond this in terms of stability / memory management over time. Will hopefully get to this soon..

nearform / cloudwatchlogs-stream

handling ThrottlingException: Rate exceeded #3