influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.57k stars 5.56k forks source link

Use of include_linked_accounts causes an IndexOutOfRange error when requesting SES metrics #15422

Closed alexeiser closed 4 months ago

alexeiser commented 4 months ago

Relevant telegraf.conf

[[inputs.cloudwatch]]
  region = "us-west-2"
  period = "6h"
  delay = "1h"
  interval = "6h"
  cache_ttl = "1h"
  namespaces = ["AWS/SES"]
  include_linked_accounts = true
  ratelimit = 10

  [[inputs.cloudwatch.metrics]]
    names = ["Bounce"]
    statistic_include = ["sum"]

Logs from Telegraf

docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN  -it  -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine telegraf --debug --test --config /etc/telegraf/telegraf.conf
2024-05-29T20:45:37Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-29T20:45:37Z I! Starting Telegraf 1.30.3 brought to you by InfluxData the makers of InfluxDB
2024-05-29T20:45:37Z I! Available plugins: 233 inputs, 9 aggregators, 31 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-05-29T20:45:37Z I! Loaded inputs: cloudwatch
2024-05-29T20:45:37Z I! Loaded aggregators: 
2024-05-29T20:45:37Z I! Loaded processors: 
2024-05-29T20:45:37Z I! Loaded secretstores: 
2024-05-29T20:45:37Z W! Outputs are not used in testing mode!
2024-05-29T20:45:37Z I! Tags enabled: host=9b5bbd346dbe
2024-05-29T20:45:37Z D! [agent] Initializing plugins
2024-05-29T20:45:37Z D! [agent] Starting service inputs
panic: runtime error: index out of range [0] with length 0
2024-05-29T13:45:37-07:00 ERR Executed command returned error: exit status 2 component=exec service=aws version=UNSET

on a non docker version - it provides a more detailed stack trace
panic: runtime error: index out of range [0] with length 0

goroutine 12 [running]:
github.com/influxdata/telegraf/plugins/inputs/cloudwatch.(*CloudWatch).getDataQueries(0xc001e6a000, {0xc001a1b180, 0x1, 0xdb797c0?})
    /go/src/github.com/influxdata/telegraf/plugins/inputs/cloudwatch/cloudwatch.go:373 +0xbf8
github.com/influxdata/telegraf/plugins/inputs/cloudwatch.(*CloudWatch).Gather(0xc001e6a000, {0x8b73fa0, 0xc001a101c0})
    /go/src/github.com/influxdata/telegraf/plugins/inputs/cloudwatch/cloudwatch.go:125 +0xa7
github.com/influxdata/telegraf/agent.(*Agent).testRunInputs.func2(0xc001e18120)
    /go/src/github.com/influxdata/telegraf/agent/agent.go:516 +0x2ca
created by github.com/influxdata/telegraf/agent.(*Agent).testRunInputs in goroutine 10
    /go/src/github.com/influxdata/telegraf/agent/agent.go:485 +0xd1

System info

Telegraf 1.30.2 ubuntu 22.04. Also occurs on Telegraf 1.30.3 in docker, and 1.30.0.

Docker

No response

Steps to reproduce

  1. Create 2 AWS accounts
  2. Configure one to be a "monitoring" account - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html
  3. Attempt to fetch the metrics for SES Bounce rages on the monitoring account.

Expected behavior

Return the metrics for the monitored accounts.

Actual behavior

Stack trace / crash.

Additional info

Execution without the include_linked_accounts=true returns valid results when using the sub-account credentials, and returns no results (as expected) on the monitoring account.

powersj commented 4 months ago

It looks like the stack trace is coming from the following code:

if c.IncludeLinkedAccounts {
    accountID = aws.String(filtered.accounts[j])
    (*dimension)["account"] = filtered.accounts[j]
}

The first call to fiter.accounts[j] to set the accountID. This would imply that there is no accountID?

I will build you a debug version tomorrow to make sure we are doing the right thing internally and then we can decide if we just skip that logic if it is out of bounds or if something else is going wrong.

alexeiser commented 4 months ago

If it helps - here is the list metrics outputs:

❯ AWS_PROFILE=SUB_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2  --metric-name Bounce
{
    "Metrics": [
        {
            "Namespace": "AWS/SES",
            "MetricName": "Bounce",
            "Dimensions": []
        }
    ]
}

❯ AWS_PROFILE=SUB_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2  --metric-name Bounce --include-linked-accounts               
{
    "Metrics": [
        {
            "Namespace": "AWS/SES",
            "MetricName": "Bounce",
            "Dimensions": []
        }
    ],
    "OwningAccounts": [
        "XXXXXXX33012"
    ]
}

❯ AWS_PROFILE=MONITORING_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2  --metric-name Bounce --include-linked-accounts
{
    "Metrics": [
        {
            "Namespace": "AWS/SES",
            "MetricName": "Bounce",
            "Dimensions": []
        }
    ],
    "OwningAccounts": [
        "XXXXXXX33012"
    ]
}

❯ AWS_PROFILE=MONITORING_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2  --metric-name Bounce                          
{
    "Metrics": []
}
powersj commented 4 months ago

I have put up https://github.com/influxdata/telegraf/pull/15428 which will have some test artifacts attached in a comment by the telegraf-tiger in 20-30mins from this message. Can you download one of the artifacts and give it a try please?

Thanks!

alexeiser commented 4 months ago

I have put up #15428 which will have some test artifacts attached in a comment by the telegraf-tiger in 20-30mins from this message. Can you download one of the artifacts and give it a try please? Good news no crash (to be expected since your prevent the call)

Bad news - no metrics either.

Same config file:

[[inputs.cloudwatch]]
  region = "us-west-2"
  period = "144h"
  delay = "1h"
  interval = "144h"
  cache_ttl = "1h"
  namespaces = ["AWS/SES"]
  include_linked_accounts = true
  ratelimit = 10

  [[inputs.cloudwatch.metrics]]
    names = ["Bounce"]
    statistic_include = ["sum"]

Monitoring account:

>  docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN  -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf

2024-05-30T19:52:28Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T19:52:28Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T19:52:28Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T19:52:28Z I! Loaded inputs: cloudwatch
2024-05-30T19:52:28Z I! Loaded aggregators: 
2024-05-30T19:52:28Z I! Loaded processors: 
2024-05-30T19:52:28Z I! Loaded secretstores: 
2024-05-30T19:52:28Z W! Outputs are not used in testing mode!
2024-05-30T19:52:28Z I! Tags enabled: host=5c66327c74cd
2024-05-30T19:52:28Z D! [agent] Initializing plugins
2024-05-30T19:52:28Z D! [agent] Starting service inputs
2024-05-30T19:52:28Z D! [agent] Stopping service inputs
2024-05-30T19:52:28Z D! [agent] Input channel closed
2024-05-30T19:52:28Z D! [agent] Stopped Successfully

Sub account

docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN  -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf

2024-05-30T19:52:43Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T19:52:43Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T19:52:43Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T19:52:43Z I! Loaded inputs: cloudwatch
2024-05-30T19:52:43Z I! Loaded aggregators: 
2024-05-30T19:52:43Z I! Loaded processors: 
2024-05-30T19:52:43Z I! Loaded secretstores: 
2024-05-30T19:52:43Z W! Outputs are not used in testing mode!
2024-05-30T19:52:43Z I! Tags enabled: host=c18dea55aa60
2024-05-30T19:52:43Z D! [agent] Initializing plugins
2024-05-30T19:52:43Z D! [agent] Starting service inputs
2024-05-30T19:52:44Z D! [agent] Stopping service inputs
2024-05-30T19:52:44Z D! [agent] Input channel closed
2024-05-30T19:52:44Z D! [agent] Stopped Successfully
> cloudwatch_aws_ses,host=c18dea55aa60,region=us-west-2 bounce_sum=2 1716576720000000000
powersj commented 4 months ago

Bad news - no metrics either.

Can you help me understand what you expected to see? From the output you showed I am assuming you see the bounce metric with your subaccount, but not your monitoring account? Wouldn't that point to a permission issue?

I want this fix in our next feature release a week from Monday, so I'm going to get that reviewed and we can talk about your possibly missing metrics.

alexeiser commented 4 months ago

In the example above - I would have expected the following: Monitoring account:

cloudwatch_aws_ses,host=c18dea55aa60,region=us-west-2,account=XXXXXXX33012 bounce_sum=2 1716576720000000000

You can see earlier - where I made the same List Metrics calls - and that from the monitoring account it can see the sub-account's metric.

powersj commented 4 months ago

Beside the credentials, what is the difference between those two runs?

alexeiser commented 4 months ago

The monitoring test uses an AWS IAM role from the monitoring account, and the sub_account uses a sub_account role

alexeiser commented 4 months ago

Here is a different example that does work - and is not missing metrics New config

[[inputs.cloudwatch]]
  region = "us-west-2"
  period = "144h"
  delay = "1h"
  interval = "144h"
  cache_ttl = "1h"
  namespaces = ["AWS/SES"]
  include_linked_accounts = true
  ratelimit = 10

  [[inputs.cloudwatch.metrics]]
    names = ["Bounce"]
    statistic_include = ["sum"]

[[inputs.cloudwatch]]
  alias = "RDS"
  region = "us-west-2"
  period = "1h"
  delay = "1h"
  interval = "1h"
  cache_ttl = "1h"
  namespaces = ["AWS/RDS"]
  include_linked_accounts = true
  ratelimit = 10

  [[inputs.cloudwatch.metrics]]
    names = ["DatabaseConnections"]
    statistic_include = ["maximum"]

    [[inputs.cloudwatch.metrics.dimensions]]
      name = "DBInstanceIdentifier"
      value = "*"

adds an RDS metric.

MONITORING ACCOUNT:

❯ docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN  -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf
2024-05-30T20:51:16Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T20:51:16Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T20:51:16Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T20:51:16Z I! Loaded inputs: cloudwatch (2x)
2024-05-30T20:51:16Z I! Loaded aggregators: 
2024-05-30T20:51:16Z I! Loaded processors: 
2024-05-30T20:51:16Z I! Loaded secretstores: 
2024-05-30T20:51:16Z W! Outputs are not used in testing mode!
2024-05-30T20:51:16Z I! Tags enabled: host=5f07dc0b6ce8
2024-05-30T20:51:16Z D! [agent] Initializing plugins
2024-05-30T20:51:16Z D! [agent] Starting service inputs
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-multi-stage-1,host=XXXXX,region=us-west-2 database_connections_maximum=21 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-multi-stage-1,host=XXXXX,region=us-west-2 database_connections_maximum=2 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-testing-2,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-testing-1,host=XXXXX,region=us-west-2 database_connections_maximum=69 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-db,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-synth-devmark-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-corp-multi-stage-1,host=XXXXX,region=us-west-2 database_connections_maximum=10 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-synth-dev-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
2024-05-30T20:51:17Z D! [agent] Stopping service inputs
2024-05-30T20:51:17Z D! [agent] Input channel closed
2024-05-30T20:51:17Z D! [agent] Stopped Successfully

From the monitoring account

❯ docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN  -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf
2024-05-30T20:51:27Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T20:51:27Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T20:51:27Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T20:51:27Z I! Loaded inputs: cloudwatch (2x)
2024-05-30T20:51:27Z I! Loaded aggregators: 
2024-05-30T20:51:27Z I! Loaded processors: 
2024-05-30T20:51:27Z I! Loaded secretstores: 
2024-05-30T20:51:27Z W! Outputs are not used in testing mode!
2024-05-30T20:51:27Z I! Tags enabled: host=df664e9c4487
2024-05-30T20:51:27Z D! [agent] Initializing plugins
2024-05-30T20:51:27Z D! [agent] Starting service inputs
> cloudwatch_aws_ses,host=df664e9c4487,region=us-west-2 bounce_sum=2 1716580260000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-synth-devmark-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXX-testing-1,host=XXXXX,region=us-west-2 database_connections_maximum=69 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXX-synth-dev-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXX-testing-2,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
2024-05-30T20:51:28Z D! [agent] Stopping service inputs
2024-05-30T20:51:28Z D! [agent] Input channel closed
2024-05-30T20:51:28Z D! [agent] Stopped Successfully

You can see how the monitoring account shows RDS servers from both itself (XXXXXXX6611), and the sub account (XXXXXXX3012).

alexeiser commented 4 months ago

There is nothing different between the two runs other then different AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN items.

powersj commented 4 months ago

There is nothing different between the two runs other then different AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN items.

That sounds like a permissions/role based issue that you need to resolve. If telegraf is fully capable of getting those metrics with one role, but not with another, doesn't that point to a permissions issue?

The next step would be to do another debug build with some debug output, but I'm also not sure what to collect, other than we probably are not seeing anything returned in the first place.

The only other issue I can think of is https://github.com/influxdata/telegraf/issues/11963 which required setting the AWS_REGION env variable, otherwise the wrong region was used, but that was for the cloudwatch output, not input.

alexeiser commented 4 months ago

I can guarantee its not a permission issue - Cross region cloudwatch all runs with in the same AWS account that the role runs in - the difference is that the getMetricData call needs the AccountID that is specified from the list-metric's call

in the aws cli examples at the top:

 "OwningAccounts": [
        "XXXXXXX33012"
    ]

So with a metric request like the following:

{
    "MetricDataQueries": [
        {
            "Id": "myRequest",
                    "AccountId": "XXXXXXX33012",
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/SES",
                    "MetricName": "Bounce",
                    "Dimensions": []
                },
                "Period": 86400,
                "Stat": "Average"
            },
            "Label": "myRequestLabel",
            "ReturnData": true
        }
    ],
    "StartTime": "2024-05-20T10:40:0000",
    "EndTime": "2024-05-30T14:12:0000"
}

The same aws identities (both of them) returns he same output Sub Account:

AWS_PROFILE=SUBACCOUNT aws cloudwatch get-metric-data --cli-input-json file:///tmp/jsonfile.json

{
    "MetricDataResults": [
        {
            "Id": "myRequest",
            "Label": "myRequestLabel",
            "Timestamps": [
                "2024-05-29T10:40:00+00:00"
            ],
            "Values": [
                1.0
            ],
            "StatusCode": "Complete"
        }
    ],
    "Messages": []
}

Monitoring account:

AWS_PROFILE=MONITORING aws cloudwatch get-metric-data --cli-input-json file:///tmp/jsonfile.json

{
    "MetricDataResults": [
        {
            "Id": "myRequest",
            "Label": "myRequestLabel",
            "Timestamps": [
                "2024-05-29T10:40:00+00:00"
            ],
            "Values": [
                1.0
            ],
            "StatusCode": "Complete"
        }
    ],
    "Messages": []
}
alexeiser commented 4 months ago

I decided to do my own investigation (build locally, add debug, etc) - and I think the source of the issue is https://github.com/influxdata/telegraf/blob/8bda03e1c0b176eab5c0070c7ab8e87c35b2febd/plugins/inputs/cloudwatch/cloudwatch.go#L247-L281

If I understand what the code is doing - its trying to filter the list of the accounts based on if the metrics are enabled. There seem to be three cases: 1: A metric is set as a filter with a dimension -> The accounts are added. 2: A metric is not set > the accounts are added. 3: A metric is set - but there are no dimensions (like with Bounce) -> the accounts are not added to the dictionary - and causes my metrics bug (as well as the error related to the index out of bounds).

My gut says to just always add the accounts - but obviously there are some optimizations for not doing a get metric call if the account doesn't have that metric.

alexeiser commented 4 months ago

Since this was closed from the merge of #15428 - @powersj - did you want me to submit a new bug report for the account handling issue?

powersj commented 4 months ago

@alexeiser,

I really appreciate you digging into this last night. Let's re-open this issue and continue to work through your findings. I'll take a deeper look at your findings a bit later today.

powersj commented 4 months ago

My gut says to just always add the accounts - but obviously there are some optimizations for not doing a get metric call if the account doesn't have that metric.

I went back and read the PR from last year that added the linked accounts functionality. We did not have a discussion about why it was only added in two of the cases, but if I had to guess it was to support that user's particular scenario.

Given adding the linked accounts is already opt-in and it is clearly missing different scenarios, this seems like a low-risk and probably the right thing to do.

What I need from you is to help test another PR. I don't quite have access to a scenario where this all takes place and I'm not 100% certain this is the correct fix, but can you take a look at #15440 and let me know if that is what you were thinking?

Thanks

alexeiser commented 4 months ago

As soon as the binaries are ready, I’ll give it a try

alexeiser commented 4 months ago
> cloudwatch_aws_ses,account=XXXXX3012,host=3376ca665404,region=us-west-2 bounce_sum=2 1716650400000000000

Yup - that version does return the metric as expected.

powersj commented 4 months ago

Thank you again for confirming! I'll get that reviewed and landed for v1.31.0, that will release on or around Monday, June 10.