lucian-ioan / public-notes

Apache License 2.0
0 stars 0 forks source link

[AWS] [API Gateway] Figure out how to fetch metrics via metricbeat #1

Open lucian-ioan opened 1 year ago

lucian-ioan commented 1 year ago

API Gateway metrics are sometimes not being fetched using ./metricbeat run -v from beats/x-pack/metricbeat. Metricbeat works fine but I am getting no events.

Possible issues: 1) No metrics can be fetched for any service -> using the same config with AWS/EC2 as namespace and with the appropriate metrics works. AWS/ApiGateway is the correct namespace, metrics used are also correct.

"metricbeat":{"aws":{"cloudwatch":{"events":42,"failures":0,"success":42}}} after fetching EC2 metrics for ~1 minute. "metricbeat":{"aws":{"cloudwatch":{"events":0,"failures":0,"success":0}}} after fetching API Gateway metrics for ~15 minutes.

2) Too few metrics are being generated by the API Gateway?

3) ???

Config in beats/x-pack/metricbeat/modules.d/aws.yml:

- module: aws
  period: 1m
  metricsets:
    - cloudwatch
  access_key_id: REDACTED
  secret_access_key: REDACTED
  metrics:
  - namespace: AWS/ApiGateway
    statistic: ["Average"]
    name:
    - 4XXError
    - 5XXError
    - CacheHitCount
    - CacheMissCount
    - Count
    - IntegrationLatency
    - Latency

Metrics via CloudWatch:

api_gateway_cloudwatch

Metrics via AWS CLI command:

aws cloudwatch get-metric-statistics \
    --namespace AWS/ApiGateway \
    --metric-name Count \
    --start-time "$(date -u +"%Y-%m-%dT00:00:00Z")" \
    --end-time "$(date -u +"%Y-%m-%dT23:59:59Z")" \
    --period 300 \
    --statistics Average
apigateway_CLI
zmoog commented 1 year ago

First, check if the metrics are included in the time window the CloudWatch metricset uses:

https://github.com/elastic/beats/blob/f65554fdfb439f71f7107524c4c050767f2e9bb7/x-pack/metricbeat/module/aws/utils.go#L32-L37

zmoog commented 1 year ago

Second, if the API Gateway service or CloudWatch takes time to publish the metrics, then Metricbeat may miss them.

We can leverage the latency configuration option to check and address this issue.

You could :

Five minutes is very high, so it may be a latency issue if it starts collecting metrics reliably. If so, find the threshold by decreasing the latency value until you stop receiving metrics reliably.