prometheus-community / ecs_exporter

Prometheus exporter for Amazon Elastic Container Service (ECS)
Apache License 2.0
79 stars 20 forks source link

Release 0.2.0 - Multiple Issues #40

Closed jseiser closed 1 year ago

jseiser commented 2 years ago

@SuperQ @roidelapluie @sysadmind

I think the latest release broke, most things.

Steps to reproduce:

  1. Deploy 0.2.0 into ECS fargate. image : "quay.io/prometheuscommunity/ecs-exporter:v0.2.0"
  2. Curl the endpoint. curl 10.1.110.149:9779/metrics > metrics.txt

What we are seeing.

  1. Memory metrics all return 0 now.
# HELP ecs_memory_bytes Memory usage in bytes.
# TYPE ecs_memory_bytes gauge
ecs_memory_bytes{container="console"} 0
ecs_memory_bytes{container="log_router"} 0
ecs_memory_bytes{container="nginx"} 0
ecs_memory_bytes{container="prom_exporter"} 0
# HELP ecs_memory_cache_usage Memory cache usage in bytes.
# TYPE ecs_memory_cache_usage gauge
ecs_memory_cache_usage{container="console"} 0
ecs_memory_cache_usage{container="log_router"} 0
ecs_memory_cache_usage{container="nginx"} 0
ecs_memory_cache_usage{container="prom_exporter"} 0
# HELP ecs_memory_limit_bytes Memory limit in bytes.
# TYPE ecs_memory_limit_bytes gauge
ecs_memory_limit_bytes{container="console"} 0
ecs_memory_limit_bytes{container="log_router"} 0
ecs_memory_limit_bytes{container="nginx"} 0
ecs_memory_limit_bytes{container="prom_exporter"} 0
# HELP ecs_memory_max_bytes Maximum memory usage in bytes.
# TYPE ecs_memory_max_bytes gauge
ecs_memory_max_bytes{container="console"} 0
ecs_memory_max_bytes{container="log_router"} 0
ecs_memory_max_bytes{container="nginx"} 0
ecs_memory_max_bytes{container="prom_exporter"} 0
  1. The CPU metrics do not even show up.
[automation@ip-10-1-118-213 ecs-service-discovery]$ grep cpu metrics.txt
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.14
jseiser commented 2 years ago

Same issue with the main image.

jseiser commented 2 years ago

The only log generated

{
    "container_id": "50ab69a6813f4ceca8b38402f74d86ae-33489729",
    "container_name": "prom_exporter",
    "deployment_id": "JLS",
    "ecs_cluster": "Foundation_DEV_JLS",
    "ecs_task_arn": "stuff here",
    "ecs_task_definition": "Foundation-DEV-JLS-Console:27",
    "environment": "DEV",
    "layer": "console",
    "log": "2022/08/08 19:56:34 Starting server at \":9779\"",
    "source": "stderr"
}
SuperQ commented 2 years ago

Hmm. It would help to have a debug log and an example ECS API request so we can try and reproduce it.

jseiser commented 2 years ago

Hmm. It would help to have a debug log and an example ECS API request so we can try and reproduce it.

@SuperQ

If you can tell me how to generate what you want, I can get it.

jseiser commented 2 years ago

@SuperQ

While waiting for further instruction, I did revert to the older v0.1.1 image, and can confirm the metrics return, the CPU ones are just incorrect as we know.

[automation@ip-10-1-118-213 ecs-service-discovery]$ curl 10.1.110.54:9779/metrics > metrics.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11530    0 11530    0     0  2251k      0 --:--:-- --:--:-- --:--:-- 2251k
[automation@ip-10-1-118-213 ecs-service-discovery]$ grep cpu metrics.txt
# HELP ecs_cpu_seconds_total Total CPU usage in seconds.
# TYPE ecs_cpu_seconds_total counter
ecs_cpu_seconds_total{container="console",cpu="0"} 5.834025195e+07
ecs_cpu_seconds_total{container="console",cpu="1"} 5.421806216e+07
ecs_cpu_seconds_total{container="log_router",cpu="0"} 6.18522212e+06
ecs_cpu_seconds_total{container="log_router",cpu="1"} 5.15301504e+06
ecs_cpu_seconds_total{container="nginx",cpu="0"} 7.73859699e+06
ecs_cpu_seconds_total{container="nginx",cpu="1"} 1.002344517e+07
ecs_cpu_seconds_total{container="prom_exporter",cpu="0"} 2.51195986e+06
ecs_cpu_seconds_total{container="prom_exporter",cpu="1"} 4.74037571e+06
[automation@ip-10-1-118-213 ecs-service-discovery]$ grep ecs_mem metrics.txt
# HELP ecs_memory_bytes Memory usage in bytes.
# TYPE ecs_memory_bytes gauge
ecs_memory_bytes{container="console"} 1.63786752e+08
ecs_memory_bytes{container="log_router"} 6.3922176e+07
ecs_memory_bytes{container="nginx"} 2.8196864e+07
ecs_memory_bytes{container="prom_exporter"} 2.6288128e+07
SuperQ commented 2 years ago

There should be an env var, ECS_CONTAINER_METADATA_URI_V4

curl -o stats.json "${ECS_CONTAINER_METADATA_URI_V4}/task/stats"
curl -o task.json "${ECS_CONTAINER_METADATA_URI_V4}/task"
jseiser commented 2 years ago

@SuperQ

stats.txt task.txt

metrics.txt

This is with the NEW 0.2.0 image running. Notice I did have to scrub out some data, like account #'s etc. Also, made them .txt since github wont allow you to upload .json I also uploaded a metrics.

SuperQ commented 2 years ago

The important part is to have real json values from the API so we can write tests.

Scrubbing account data is fine.

jseiser commented 2 years ago

The important part is to have real json values from the API so we can write tests.

The JSON provided in those text files are from the calls you requested.

jseiser commented 2 years ago

@SuperQ

Let me know if there is anything I can provide for you on this. I have env's in place I can use to test images with, I dont know Go well enough to provide much technical assistance.

Thank you.

jseiser commented 2 years ago

Sorry to be a bother, I understand this exporter doesn't appear to get much use, but any chance you can provide documentation on how to test this, so I can maybe try and poke at it?

All of our ECS tasks now have broken memory and CPU metrics, where before we just had broken CPU metrics.

Thank you.

jseiser commented 2 years ago

Just wanted to check in and see if there was anything we can do to provide assistance.

Thanks,

SuperQ commented 2 years ago

Hey, I have been on vacation, so haven't had time to look into this. For now, running the old version and dividing the CPU results by 10000000 should work.

jseiser commented 2 years ago

@SuperQ

Thanks for the update, and I hope your vacation went well.

I will revert to 1.1 and see if we can get the CPU stuff working.

I would like to again offer that if we could get a bit more info on the building/testing we could attempt to provide fixes from our side. We have no go specific dev's, but we def. have dev's.

Thanks

SuperQ commented 2 years ago

If you can figure out where the parsing is going wrong, that'd be helpful.

I think the fixtures you provided should be sufficient to test with, I just haven't had a chance to make them into a unit test.

jseiser commented 2 years ago

@SuperQ

I guess im asking more for how you guys do the testing, and building. There is no documentation on it that I can find. We have a stable of python dev's, and I have a bit of golang experience, so I imagine we could potentially cobble together some eyes on this.

SuperQ commented 2 years ago

Everything is done via our CI pipelines.

The testing in this exporter is pretty minimal, since it doesn't have access to a real AWS API.

We can add a simulated http response to the unit tests. That should prevent future issues.

jseiser commented 2 years ago

@SuperQ

Any information on how we can build/test this locally would help. Otherwise we are really just stuck here waiting/bugging you and I know you guys are busy. We have dev's, just not Golang specific devs.

jseiser commented 2 years ago

Hey, I have been on vacation, so haven't had time to look into this. For now, running the old version and dividing the CPU results by 10000000 should work.

Im not sure what is correct here.

Like Cadvisor looks like

sum by (name) (irate(container_cpu_user_seconds_total{instance=~"$node",job=~"$job",image!="", name=~"$name"}[$__rate_interval]) / scalar(machine_cpu_cores{instance=~"$node",job=~"$job"}) ) * 100 

Python Prometheus Exporter looks like

rate(ecs_cpu_seconds_total{ecs_task_id="a94a0debf2c14362a6b4f22108d1ef12"}[30s])

Im not sure which model would work here, and which number should be divided?

kvendingoldo commented 1 year ago

Do we have any news here?

jseiser commented 1 year ago

@SuperQ

Just wanted to check in on this one and see if there was anything additional we could provide?

SuperQ commented 1 year ago

Sorry, I've not had time to look into this recently. Someone needs to write some unit tests with data provided above so we can reproduce the issue.

jseiser commented 1 year ago

@SuperQ

I just wanted to touch base on this one again, its still not working properly.

You posted above about working around the issue, but I wasn't really sure on how you were saying to do it. This was on on Aug 24, 2022. Any chance you can clarify what you meant?

As for writing the additional tests, if someone could provide more context I could probably get one of our dev's to take a stab at it. Outside of knowing it runs in your CI pipelines, i didnt see much more than a large make file and it wasnt clear to me where to start investigating.

Also, Im not sure if this is volunteer work for you or not, if it is, is there anything we can do privately to push this one over the line?

Thanks,

rkul commented 1 year ago

This can be easily fixed I assume by merging my PR https://github.com/prometheus-community/ecs_exporter/pull/49

SuperQ commented 1 year ago

Ok, fix is released, please give it a try.

rkul commented 1 year ago

I've tested it on several our ECS clusters, it looks good. I see few another issues: CPU metrics require normalisation and memory limit returns 1^64 for Fargate tasks, where container memory limit isn't set explicitly. But the initial issue definitely solved.

Screenshot 2023-01-26 at 19 37 06 Screenshot 2023-01-26 at 19 40 09

*There are tasks on common graph that were terminated due to scale down

jseiser commented 1 year ago

Any chance you can post the working queries? About to deploy this to our ecs/far hate environments

Thanks

On Thu, Jan 26, 2023, 11:20 AM Roman Kulayev @.***> wrote:

I've tested it on several our ECS clusters, it looks good. I see few another issues: CPU metrics require normalisation and memory limit returns 1^64 for Fargate tasks, where container memory limit isn't set explicitly. But the initial issue definitely solved.

[image: Screenshot 2023-01-26 at 19 37 06] https://user-images.githubusercontent.com/5493637/214884104-0371d321-2f9e-4983-8803-3fefc9cea481.png

[image: Screenshot 2023-01-26 at 19 40 09] https://user-images.githubusercontent.com/5493637/214884139-cb866566-b577-452d-8975-6c23772bd812.png

*There are tasks on common graph that were terminated due to scale down

— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/ecs_exporter/issues/40#issuecomment-1405258240, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFBNZ5UIHDYXYHAAVMUTGTWUKP5BANCNFSM556JK73Q . You are receiving this because you authored the thread.Message ID: @.***>

rkul commented 1 year ago

I did a bit of research and figured out that AWS updates Task Metadata with Docker stats every 10s, so rate is okay for that if you set interval like 1m. Also, CPU utilization is in percents of one single core. So if you allocated something like .25vCPU for task/container, then 25% means full utilization.

sum by (ecs_task_arn)(rate(ecs_cpu_seconds_total{ecs_task_name="$app"}[$interval])) * 100
sum by (container,ecs_task_arn)(rate(ecs_cpu_seconds_total{ecs_task_arn="$task"}[$interval])) * 100
sum by (ecs_task_arn)(ecs_memory_bytes{ecs_task_name="$app"})/1024/1024
sum by (container,ecs_task_arn)(ecs_memory_bytes{ecs_task_arn="$task"})/1024/1024
SuperQ commented 1 year ago

FYI, you want to use $__rate_interval in Grafana.

jseiser commented 1 year ago

@rkul

Do your graphs actually reflect whats shown in Cloudwatch ? Using your queries, we show things like 1-2% CPU usage, while cloudwatch will show 70%.