influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.92k stars 5.6k forks source link

outputs.kinesis silently drops metrics #8818

Open JeffAshton opened 3 years ago

JeffAshton commented 3 years ago

Steps to reproduce:

Use Kinesis normally. The PutRecords API by design may only be partially successful.

{
   "EncryptionType": "string",
   "FailedRecordCount": number,      // The number of unsuccessfully processed records in a PutRecords request.
   "Records": [ 
      { 
         "ErrorCode": "string",      // The error code for an individual record result. 
         "ErrorMessage": "string",
         "SequenceNumber": "string",
         "ShardId": "string"
      }
   ]
}

It's possible for the request itself to be successul, but in reality not actually process any records.

Expected behavior:

At a minimum communicate that metrics are being dropped. Ideally this would be captured by telegraf_internal_write.

But I'm going to work towards adding retry support.

Actual behavior:

Metrics get silently dropped.

powersj commented 2 years ago

@JeffAshton,

It looks like you completed a PR via #8817, which says it is part 1. Was there an additional PR that went in such that this is now resolved?

Thanks!

JeffAshton commented 2 years ago

@JeffAshton,

It looks like you completed a PR via #8817, which says it is part 1. Was there an additional PR that went in such that this is now resolved?

Thanks!

I actually ended up just writing my own output plugin:

https://github.com/Brightspace/telegraf/tree/master/plugins/outputs/d2l_kinesis

I wanted to gzip multiple metrics into a single kinesis record. Ended up reducing our kinesis shards drastically and saved a bit of money. Trying to maintain backwards compatibility with options like the partition method complicated things beyond my needs.

The cost of a single metric per kinesis record is actually cost prohibitive from my experience.

powersj commented 2 years ago

Thanks for the update and it is interesting to hear about the costs.

In terms of this bug, and silently dropping metrics, was this at least resolved?

JeffAshton commented 2 years ago

Thanks for the update and it is interesting to hear about the costs.

I addressed it in my output plugin.

In terms of this bug, and silently dropping metrics, was this at least resolved?

The costs balloon because of this ingestion limit (I'm paraphrasing):

A single shard can ingest up to 1,000 records per second for writes.

https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

So if you're trying to write 20,000 metrics a second for example, you'll need 20 shards at a minimum (probably double in practice if you don't want metrics being dropped constantly). At $0.04 per shard hour,

30 days 24 hours $0.04 * 20 shards = $576 per month

Compressing the metrics into a single record I'm able to comfortably get a way with a single shard a month,

30 days 24 hours $0.04 = $28.8 per month