Closed alexeiser closed 4 months ago
It looks like the stack trace is coming from the following code:
if c.IncludeLinkedAccounts {
accountID = aws.String(filtered.accounts[j])
(*dimension)["account"] = filtered.accounts[j]
}
The first call to fiter.accounts[j]
to set the accountID. This would imply that there is no accountID?
I will build you a debug version tomorrow to make sure we are doing the right thing internally and then we can decide if we just skip that logic if it is out of bounds or if something else is going wrong.
If it helps - here is the list metrics outputs:
❯ AWS_PROFILE=SUB_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2 --metric-name Bounce
{
"Metrics": [
{
"Namespace": "AWS/SES",
"MetricName": "Bounce",
"Dimensions": []
}
]
}
❯ AWS_PROFILE=SUB_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2 --metric-name Bounce --include-linked-accounts
{
"Metrics": [
{
"Namespace": "AWS/SES",
"MetricName": "Bounce",
"Dimensions": []
}
],
"OwningAccounts": [
"XXXXXXX33012"
]
}
❯ AWS_PROFILE=MONITORING_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2 --metric-name Bounce --include-linked-accounts
{
"Metrics": [
{
"Namespace": "AWS/SES",
"MetricName": "Bounce",
"Dimensions": []
}
],
"OwningAccounts": [
"XXXXXXX33012"
]
}
❯ AWS_PROFILE=MONITORING_ACCOUNT aws cloudwatch list-metrics --namespace AWS/SES --region us-west-2 --metric-name Bounce
{
"Metrics": []
}
I have put up https://github.com/influxdata/telegraf/pull/15428 which will have some test artifacts attached in a comment by the telegraf-tiger in 20-30mins from this message. Can you download one of the artifacts and give it a try please?
Thanks!
I have put up #15428 which will have some test artifacts attached in a comment by the telegraf-tiger in 20-30mins from this message. Can you download one of the artifacts and give it a try please? Good news no crash (to be expected since your prevent the call)
Bad news - no metrics either.
Same config file:
[[inputs.cloudwatch]]
region = "us-west-2"
period = "144h"
delay = "1h"
interval = "144h"
cache_ttl = "1h"
namespaces = ["AWS/SES"]
include_linked_accounts = true
ratelimit = 10
[[inputs.cloudwatch.metrics]]
names = ["Bounce"]
statistic_include = ["sum"]
Monitoring account:
> docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf
2024-05-30T19:52:28Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T19:52:28Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T19:52:28Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T19:52:28Z I! Loaded inputs: cloudwatch
2024-05-30T19:52:28Z I! Loaded aggregators:
2024-05-30T19:52:28Z I! Loaded processors:
2024-05-30T19:52:28Z I! Loaded secretstores:
2024-05-30T19:52:28Z W! Outputs are not used in testing mode!
2024-05-30T19:52:28Z I! Tags enabled: host=5c66327c74cd
2024-05-30T19:52:28Z D! [agent] Initializing plugins
2024-05-30T19:52:28Z D! [agent] Starting service inputs
2024-05-30T19:52:28Z D! [agent] Stopping service inputs
2024-05-30T19:52:28Z D! [agent] Input channel closed
2024-05-30T19:52:28Z D! [agent] Stopped Successfully
Sub account
docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf
2024-05-30T19:52:43Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T19:52:43Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T19:52:43Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T19:52:43Z I! Loaded inputs: cloudwatch
2024-05-30T19:52:43Z I! Loaded aggregators:
2024-05-30T19:52:43Z I! Loaded processors:
2024-05-30T19:52:43Z I! Loaded secretstores:
2024-05-30T19:52:43Z W! Outputs are not used in testing mode!
2024-05-30T19:52:43Z I! Tags enabled: host=c18dea55aa60
2024-05-30T19:52:43Z D! [agent] Initializing plugins
2024-05-30T19:52:43Z D! [agent] Starting service inputs
2024-05-30T19:52:44Z D! [agent] Stopping service inputs
2024-05-30T19:52:44Z D! [agent] Input channel closed
2024-05-30T19:52:44Z D! [agent] Stopped Successfully
> cloudwatch_aws_ses,host=c18dea55aa60,region=us-west-2 bounce_sum=2 1716576720000000000
Bad news - no metrics either.
Can you help me understand what you expected to see? From the output you showed I am assuming you see the bounce metric with your subaccount, but not your monitoring account? Wouldn't that point to a permission issue?
I want this fix in our next feature release a week from Monday, so I'm going to get that reviewed and we can talk about your possibly missing metrics.
In the example above - I would have expected the following: Monitoring account:
cloudwatch_aws_ses,host=c18dea55aa60,region=us-west-2,account=XXXXXXX33012 bounce_sum=2 1716576720000000000
You can see earlier - where I made the same List Metrics calls - and that from the monitoring account it can see the sub-account's metric.
Beside the credentials, what is the difference between those two runs?
The monitoring test uses an AWS IAM role from the monitoring account, and the sub_account uses a sub_account role
Here is a different example that does work - and is not missing metrics New config
[[inputs.cloudwatch]]
region = "us-west-2"
period = "144h"
delay = "1h"
interval = "144h"
cache_ttl = "1h"
namespaces = ["AWS/SES"]
include_linked_accounts = true
ratelimit = 10
[[inputs.cloudwatch.metrics]]
names = ["Bounce"]
statistic_include = ["sum"]
[[inputs.cloudwatch]]
alias = "RDS"
region = "us-west-2"
period = "1h"
delay = "1h"
interval = "1h"
cache_ttl = "1h"
namespaces = ["AWS/RDS"]
include_linked_accounts = true
ratelimit = 10
[[inputs.cloudwatch.metrics]]
names = ["DatabaseConnections"]
statistic_include = ["maximum"]
[[inputs.cloudwatch.metrics.dimensions]]
name = "DBInstanceIdentifier"
value = "*"
adds an RDS metric.
MONITORING ACCOUNT:
❯ docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf
2024-05-30T20:51:16Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T20:51:16Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T20:51:16Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T20:51:16Z I! Loaded inputs: cloudwatch (2x)
2024-05-30T20:51:16Z I! Loaded aggregators:
2024-05-30T20:51:16Z I! Loaded processors:
2024-05-30T20:51:16Z I! Loaded secretstores:
2024-05-30T20:51:16Z W! Outputs are not used in testing mode!
2024-05-30T20:51:16Z I! Tags enabled: host=5f07dc0b6ce8
2024-05-30T20:51:16Z D! [agent] Initializing plugins
2024-05-30T20:51:16Z D! [agent] Starting service inputs
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-multi-stage-1,host=XXXXX,region=us-west-2 database_connections_maximum=21 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-multi-stage-1,host=XXXXX,region=us-west-2 database_connections_maximum=2 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-testing-2,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-testing-1,host=XXXXX,region=us-west-2 database_connections_maximum=69 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-db,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-synth-devmark-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX6611,db_instance_identifier=XXXXX-corp-multi-stage-1,host=XXXXX,region=us-west-2 database_connections_maximum=10 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-synth-dev-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
2024-05-30T20:51:17Z D! [agent] Stopping service inputs
2024-05-30T20:51:17Z D! [agent] Input channel closed
2024-05-30T20:51:17Z D! [agent] Stopped Successfully
From the monitoring account
❯ docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN -it -v /Users/alexe/Downloads/telegraf-1.31.0/usr/bin/:/opt/telegraf-test -v /tmp/test.conf:/etc/telegraf/telegraf.conf:ro --rm telegraf:1.30.3-alpine /opt/telegraf-test/telegraf --debug --test --config /etc/telegraf/telegraf.conf
2024-05-30T20:51:27Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-30T20:51:27Z I! Starting Telegraf 1.31.0-8bda03e1 brought to you by InfluxData the makers of InfluxDB
2024-05-30T20:51:27Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-05-30T20:51:27Z I! Loaded inputs: cloudwatch (2x)
2024-05-30T20:51:27Z I! Loaded aggregators:
2024-05-30T20:51:27Z I! Loaded processors:
2024-05-30T20:51:27Z I! Loaded secretstores:
2024-05-30T20:51:27Z W! Outputs are not used in testing mode!
2024-05-30T20:51:27Z I! Tags enabled: host=df664e9c4487
2024-05-30T20:51:27Z D! [agent] Initializing plugins
2024-05-30T20:51:27Z D! [agent] Starting service inputs
> cloudwatch_aws_ses,host=df664e9c4487,region=us-west-2 bounce_sum=2 1716580260000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXXX-synth-devmark-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXX-testing-1,host=XXXXX,region=us-west-2 database_connections_maximum=69 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXX-synth-dev-1,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
> cloudwatch_aws_rds,account=XXXXXXX3012,db_instance_identifier=XXXX-testing-2,host=XXXXX,region=us-west-2 database_connections_maximum=0 1717095060000000000
2024-05-30T20:51:28Z D! [agent] Stopping service inputs
2024-05-30T20:51:28Z D! [agent] Input channel closed
2024-05-30T20:51:28Z D! [agent] Stopped Successfully
You can see how the monitoring account shows RDS servers from both itself (XXXXXXX6611
), and the sub account (XXXXXXX3012
).
There is nothing different between the two runs other then different AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN items.
There is nothing different between the two runs other then different AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN items.
That sounds like a permissions/role based issue that you need to resolve. If telegraf is fully capable of getting those metrics with one role, but not with another, doesn't that point to a permissions issue?
The next step would be to do another debug build with some debug output, but I'm also not sure what to collect, other than we probably are not seeing anything returned in the first place.
The only other issue I can think of is https://github.com/influxdata/telegraf/issues/11963 which required setting the AWS_REGION env variable, otherwise the wrong region was used, but that was for the cloudwatch output, not input.
I can guarantee its not a permission issue - Cross region cloudwatch all runs with in the same AWS account that the role runs in - the difference is that the getMetricData call needs the AccountID that is specified from the list-metric's call
in the aws cli examples at the top:
"OwningAccounts": [
"XXXXXXX33012"
]
So with a metric request like the following:
{
"MetricDataQueries": [
{
"Id": "myRequest",
"AccountId": "XXXXXXX33012",
"MetricStat": {
"Metric": {
"Namespace": "AWS/SES",
"MetricName": "Bounce",
"Dimensions": []
},
"Period": 86400,
"Stat": "Average"
},
"Label": "myRequestLabel",
"ReturnData": true
}
],
"StartTime": "2024-05-20T10:40:0000",
"EndTime": "2024-05-30T14:12:0000"
}
The same aws identities (both of them) returns he same output Sub Account:
AWS_PROFILE=SUBACCOUNT aws cloudwatch get-metric-data --cli-input-json file:///tmp/jsonfile.json
{
"MetricDataResults": [
{
"Id": "myRequest",
"Label": "myRequestLabel",
"Timestamps": [
"2024-05-29T10:40:00+00:00"
],
"Values": [
1.0
],
"StatusCode": "Complete"
}
],
"Messages": []
}
Monitoring account:
AWS_PROFILE=MONITORING aws cloudwatch get-metric-data --cli-input-json file:///tmp/jsonfile.json
{
"MetricDataResults": [
{
"Id": "myRequest",
"Label": "myRequestLabel",
"Timestamps": [
"2024-05-29T10:40:00+00:00"
],
"Values": [
1.0
],
"StatusCode": "Complete"
}
],
"Messages": []
}
I decided to do my own investigation (build locally, add debug, etc) - and I think the source of the issue is https://github.com/influxdata/telegraf/blob/8bda03e1c0b176eab5c0070c7ab8e87c35b2febd/plugins/inputs/cloudwatch/cloudwatch.go#L247-L281
If I understand what the code is doing - its trying to filter the list of the accounts based on if the metrics are enabled. There seem to be three cases: 1: A metric is set as a filter with a dimension -> The accounts are added. 2: A metric is not set > the accounts are added. 3: A metric is set - but there are no dimensions (like with Bounce) -> the accounts are not added to the dictionary - and causes my metrics bug (as well as the error related to the index out of bounds).
My gut says to just always add the accounts - but obviously there are some optimizations for not doing a get metric call if the account doesn't have that metric.
Since this was closed from the merge of #15428 - @powersj - did you want me to submit a new bug report for the account handling issue?
@alexeiser,
I really appreciate you digging into this last night. Let's re-open this issue and continue to work through your findings. I'll take a deeper look at your findings a bit later today.
My gut says to just always add the accounts - but obviously there are some optimizations for not doing a get metric call if the account doesn't have that metric.
I went back and read the PR from last year that added the linked accounts functionality. We did not have a discussion about why it was only added in two of the cases, but if I had to guess it was to support that user's particular scenario.
Given adding the linked accounts is already opt-in and it is clearly missing different scenarios, this seems like a low-risk and probably the right thing to do.
What I need from you is to help test another PR. I don't quite have access to a scenario where this all takes place and I'm not 100% certain this is the correct fix, but can you take a look at #15440 and let me know if that is what you were thinking?
Thanks
As soon as the binaries are ready, I’ll give it a try
> cloudwatch_aws_ses,account=XXXXX3012,host=3376ca665404,region=us-west-2 bounce_sum=2 1716650400000000000
Yup - that version does return the metric as expected.
Thank you again for confirming! I'll get that reviewed and landed for v1.31.0, that will release on or around Monday, June 10.
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.30.2 ubuntu 22.04. Also occurs on Telegraf 1.30.3 in docker, and 1.30.0.
Docker
No response
Steps to reproduce
Expected behavior
Return the metrics for the monitored accounts.
Actual behavior
Stack trace / crash.
Additional info
Execution without the
include_linked_accounts=true
returns valid results when using the sub-account credentials, and returns no results (as expected) on the monitoring account.