prometheus-community / yet-another-cloudwatch-exporter

Prometheus exporter for AWS CloudWatch - Discovers services through AWS tags, gets CloudWatch metrics data and provides them as Prometheus metrics with AWS tags as labels
Apache License 2.0
984 stars 335 forks source link

[BUG] large amount of SQS queues causes missing metrics #489

Open OliverKlette85 opened 2 years ago

OliverKlette85 commented 2 years ago

Is there an existing issue for this?

Current Behavior

We have around 9,5 k SQS queues in the eu-west-1 region of one of our prod accounts, but the YACE exporter only provides metrics for around 5 K of them.

image

image

I already tried to run several YACE instances in parallel:

Both didn't improve the situation. I also requested quota increase of AWS quotas for GetMetricData (1000 per second) and ListMetrics (100 per second) requests and according to AWS monitoring we are far away from reaching it.

In the YACE debug log I couldn't find any entries which explain the missing metrics.

Expected Behavior

The exporter should provide metrics of all SQS queue (it worked with official Cloudwatch exporter)

Steps To Reproduce

config:

extraArgs:
  scraping-interval: 120
  debug: true

config: |-
  discovery:
    exportedTagsOnMetrics:
      sqs:
        - dh_app
        - dh_country
        - dh_env
        - dh_platform
        - dh_region
        - dh_squad
        - dh_tribe
    jobs:
    - type: sqs
      regions:
        - eu-west-1
      delay: 600
      period: 120
      length: 120
      awsDimensions:
       - QueueName
      metrics:
        - name: ApproximateAgeOfOldestMessage
          statistics:
          - Average

Anything else?

No response

thomaspeitz commented 2 years ago

Wow thats a hugh amount of SQS queues 🎉

We need more debugging logs to find the error here. Will try to add the debugging in the next seven days.

Alternative provide me A SEPARATE AWS account with 6k SQS already created to debug this:

Cross account sharing via: arn:aws:iam::838758336246:user/debug-yace-489

Don't forget to add the permissions for this user as well:

"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"

Will debug it in the next 7 days.

OliverKlette85 commented 2 years ago

Hi Thomas, thanks for your quick reaction.

I created a role with the desired permissions and a trust policy for your user in our stg account:

arn:aws:iam::487596255802:role/yace_debug

This account has actually over 12 k SQS queues in the eu-west-1 region :)

Please ping if you need anything from my side.

thomaspeitz commented 2 years ago
{"level":"info","msg":"Couldn't describe resources for region eu-west-1: AccessDeniedException: User: arn:aws:sts::487596255802:assumed-role/yace_debug/1638822531663485000 is not authorized to perform: tag:GetResources because no identity-based policy allows the tag:GetResources action\n\tstatus code: 400, request id: f6e26e95-0a6d-4632-b2a6-052425ceeeff\n","time":"2021-12-06T21:28:52+01:00"}

Could you double check on your side that everything is configured correctly? Seems I can switch succesfully into the role but missing the permissions:

Don't forget to add the permissions for this user as well:

"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"
OliverKlette85 commented 2 years ago

Yeah I set these rights. Maybe the issue was caused by the condition I set on policy level. I moved it to the trust relationship. Could you please test again?

thomaspeitz commented 2 years ago

Okay, I see the issue. Now I have something to debug with. Had two small things in mind which were not the issue.

Need to dig deeper.

Greetings, Thomas :)

OliverKlette85 commented 2 years ago

Thanks for the efforts! Please let me know if you need something from my side.

Grüße aus Berlin :)

thomaspeitz commented 2 years ago

Grüße zurück aus Aachen (aktuell).

FYI: Was not able to put much time into it (and did not find anything yet) and will only work on this at the end of this week again due to private stuff. - Would be nice if you keep the role active so I can debug it further. - Still not understanding what happens their. Was expecting pagination bugs which does not seem to be the problem.

OliverKlette85 commented 2 years ago

Thanks for the update. I will keep the role active.

OliverKlette85 commented 2 years ago

Hi @thomaspeitz did you manage to have another look?

endyrocket commented 2 years ago

Hi! We're facing the same issue, we have around 350 queues and some of them are entirely ignored. Reverting back to old cloudwatch exporter fixes the issue. We're using version: v0.28.0-alpha

eusokolov commented 2 years ago

Hi, we are having the same issue, happy to see that it was reported already :)

thomaspeitz commented 2 years ago

Sorry was doing vacation. Back again.

@OliverKlette85 if you have the IAM still configured I will take again a look on this topic this week.

OliverKlette85 commented 2 years ago

Yes it is still active.

thomaspeitz commented 2 years ago

Thats gold worth to know "Reverting back to old cloudwatch exporter fixes the issue. We're using version: v0.28.0-alpha" @endyrocket - Thanks! - Makes it easier to debug.

Awesome thank you @OliverKlette85 will work on that probably Thuersday / Friday.

Currently my active work on the project is cut to 2h a week due to no revenue generation from the project. Will try my best to fix this but it is top of the list (at least) to get fixed.

mmanjos commented 2 years ago

I've been experiencing the same issue and I think I found a data point. All of my missing SQS queues don't have any tags applied to them. As soon as I applied one tag (anything), a few minutes later the metrics would start showing up in YACE.

I think the issue is that this API call to resourcegroupstaggingapi/get-resources used to return all SQS-type resources, but now AWS is only returning those that have been tagged.

{"ResourceTypeFilters":["sqs"],"ResourcesPerPage":100}

Maybe there's another way to get the list of resources to query? Or just tag all of your SQS queues with something arbitrary

nickyfoster commented 1 year ago

@mmanjos I can confirm that adding tags to SQS queue solves this issue.

I had the same problem with my queues not being visible in exported metrics. Turned out those queues had 0 tags on them. After adding arbitrary tag I was able to query metric.