influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

cloudwatch output plugin not working with ec2 instance profiles. #3474

Closed sbalagopal closed 6 years ago

sbalagopal commented 6 years ago

Bug report

Cloudwatch output plugin is not able to connect to cloudwatch from an ec2 machine with instance-profile that allows the server to do all operations on cloudwatch service. The function call, GetSessionToken is failing when using ec2 instance's session credentials. According the AWS, this is not allowed, "AccessDenied: Cannot call GetSessionToken with session credentials".

Relevant cloudwatch.go:

./telegraf --config telegraf.conf --debug 2017-11-14T07:59:08Z D! Attempting connection to output: cloudwatch 2017-11-14T07:59:08Z E! cloudwatch: Cannot use credentials to connect to AWS : AccessDenied: Cannot call GetSessionToken with session credentials status code: 403, request id: b169b099-c911-11e7-b9c4-bd9935259c39 2017-11-14T07:59:08Z E! Failed to connect to output cloudwatch, retrying in 15s, error was 'AccessDenied: Cannot call GetSessionToken with session credentials status code: 403, request id: b169b099-c911-11e7-b9c4-bd9935259c39' 2017-11-14T07:59:23Z E! cloudwatch: Cannot use credentials to connect to AWS : AccessDenied: Cannot call GetSessionToken with session credentials status code: 403, request id: ba5c7d38-c911-11e7-b9c4-bd9935259c39 2017-11-14T07:59:23Z E! AccessDenied: Cannot call GetSessionToken with session credentials status code: 403, request id: ba5c7d38-c911-11e7-b9c4-bd9935259c39

System info:

Telegraf v1.5.0~136c15b (git: master 136c15b) CentOS Linux release 7.3.1611 (Core)

(The instance profile has all access allowed on cloudwatch service)

Steps to reproduce:

  1. Bring up EC2 instance with instance profile that allows all cloudwatch operations.
  2. Build and configure telegraf to write metrics (mem, cpu disk etc.) to cloudwatch.
  3. Make sure the aws tokens are not available as environment variables or any configuration file for telegraf and run telegraf. it should fail when the plugin does GetSessionToken call.

Expected behavior:

Telegraf should make connection to the cloudwatch and write metrics to the namespace configured.

Actual behavior:

Exits with error message which says "Cannot use credentials to connect to AWS"

Additional info:

The actual connection to cloudwatch is indeed working with instance profile on ec2 servers. Only the validity check of the connection using "GetSessionToken" is failing, which causes the script to logically fail. If I deliberately bypass the error check and continues, it works as expected and the metrics are indeed posted to cloudwatch. The check should rely on something that might work with an instance profile on ec2 servers.

The previous version of telegraf, 1.4.4 is working fine with instance profiles as the "ListMetrics" call does work with instance profile.

arohter commented 6 years ago

I believe use of http://docs.aws.amazon.com/STS/latest/APIReference/API_GetSessionToken.html is incompatible with IAM Instance Profile roles, so we can't use sts.GetSessionToken as a validation test when falling through to instance profile creds.

danielnelson commented 6 years ago

I assume this was caused by the #3335.

@adamchainz Do you have any thoughts on how to solve?

Perhaps we should just remove this check and allow credential issues to be reported when we PutMetric.

adamchainz commented 6 years ago

I made a mistake. It's https://docs.aws.amazon.com/STS/latest/APIReference/API_GetCallerIdentity.html that should be called to check credentials are valid -it doesn't require any permissions afaik. I'll make a PR adding back in the check with this endpoint..

danielnelson commented 6 years ago

Thanks @adamchainz, yeah lets try to use this and we can do some testing to confirm.