Closed jseiser closed 1 year ago
Same issue with the main
image.
The only log generated
{
"container_id": "50ab69a6813f4ceca8b38402f74d86ae-33489729",
"container_name": "prom_exporter",
"deployment_id": "JLS",
"ecs_cluster": "Foundation_DEV_JLS",
"ecs_task_arn": "stuff here",
"ecs_task_definition": "Foundation-DEV-JLS-Console:27",
"environment": "DEV",
"layer": "console",
"log": "2022/08/08 19:56:34 Starting server at \":9779\"",
"source": "stderr"
}
Hmm. It would help to have a debug log and an example ECS API request so we can try and reproduce it.
Hmm. It would help to have a debug log and an example ECS API request so we can try and reproduce it.
@SuperQ
If you can tell me how to generate what you want, I can get it.
@SuperQ
While waiting for further instruction, I did revert to the older v0.1.1 image, and can confirm the metrics return, the CPU ones are just incorrect as we know.
[automation@ip-10-1-118-213 ecs-service-discovery]$ curl 10.1.110.54:9779/metrics > metrics.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 11530 0 11530 0 0 2251k 0 --:--:-- --:--:-- --:--:-- 2251k
[automation@ip-10-1-118-213 ecs-service-discovery]$ grep cpu metrics.txt
# HELP ecs_cpu_seconds_total Total CPU usage in seconds.
# TYPE ecs_cpu_seconds_total counter
ecs_cpu_seconds_total{container="console",cpu="0"} 5.834025195e+07
ecs_cpu_seconds_total{container="console",cpu="1"} 5.421806216e+07
ecs_cpu_seconds_total{container="log_router",cpu="0"} 6.18522212e+06
ecs_cpu_seconds_total{container="log_router",cpu="1"} 5.15301504e+06
ecs_cpu_seconds_total{container="nginx",cpu="0"} 7.73859699e+06
ecs_cpu_seconds_total{container="nginx",cpu="1"} 1.002344517e+07
ecs_cpu_seconds_total{container="prom_exporter",cpu="0"} 2.51195986e+06
ecs_cpu_seconds_total{container="prom_exporter",cpu="1"} 4.74037571e+06
[automation@ip-10-1-118-213 ecs-service-discovery]$ grep ecs_mem metrics.txt
# HELP ecs_memory_bytes Memory usage in bytes.
# TYPE ecs_memory_bytes gauge
ecs_memory_bytes{container="console"} 1.63786752e+08
ecs_memory_bytes{container="log_router"} 6.3922176e+07
ecs_memory_bytes{container="nginx"} 2.8196864e+07
ecs_memory_bytes{container="prom_exporter"} 2.6288128e+07
There should be an env var, ECS_CONTAINER_METADATA_URI_V4
curl -o stats.json "${ECS_CONTAINER_METADATA_URI_V4}/task/stats"
curl -o task.json "${ECS_CONTAINER_METADATA_URI_V4}/task"
@SuperQ
This is with the NEW 0.2.0 image running. Notice I did have to scrub out some data, like account #'s etc. Also, made them .txt
since github wont allow you to upload .json
I also uploaded a metrics.
The important part is to have real json values from the API so we can write tests.
Scrubbing account data is fine.
The important part is to have real json values from the API so we can write tests.
The JSON provided in those text files are from the calls you requested.
@SuperQ
Let me know if there is anything I can provide for you on this. I have env's in place I can use to test images with, I dont know Go well enough to provide much technical assistance.
Thank you.
Sorry to be a bother, I understand this exporter doesn't appear to get much use, but any chance you can provide documentation on how to test this, so I can maybe try and poke at it?
All of our ECS tasks now have broken memory and CPU metrics, where before we just had broken CPU metrics.
Thank you.
Just wanted to check in and see if there was anything we can do to provide assistance.
Thanks,
Hey, I have been on vacation, so haven't had time to look into this. For now, running the old version and dividing the CPU results by 10000000 should work.
@SuperQ
Thanks for the update, and I hope your vacation went well.
I will revert to 1.1 and see if we can get the CPU stuff working.
I would like to again offer that if we could get a bit more info on the building/testing we could attempt to provide fixes from our side. We have no go specific dev's, but we def. have dev's.
Thanks
If you can figure out where the parsing is going wrong, that'd be helpful.
I think the fixtures you provided should be sufficient to test with, I just haven't had a chance to make them into a unit test.
@SuperQ
I guess im asking more for how you guys do the testing, and building. There is no documentation on it that I can find. We have a stable of python dev's, and I have a bit of golang experience, so I imagine we could potentially cobble together some eyes on this.
Everything is done via our CI pipelines.
The testing in this exporter is pretty minimal, since it doesn't have access to a real AWS API.
We can add a simulated http response to the unit tests. That should prevent future issues.
@SuperQ
Any information on how we can build/test this locally would help. Otherwise we are really just stuck here waiting/bugging you and I know you guys are busy. We have dev's, just not Golang specific devs.
Hey, I have been on vacation, so haven't had time to look into this. For now, running the old version and dividing the CPU results by 10000000 should work.
Im not sure what is correct here.
Like Cadvisor looks like
sum by (name) (irate(container_cpu_user_seconds_total{instance=~"$node",job=~"$job",image!="", name=~"$name"}[$__rate_interval]) / scalar(machine_cpu_cores{instance=~"$node",job=~"$job"}) ) * 100
Python Prometheus Exporter looks like
rate(ecs_cpu_seconds_total{ecs_task_id="a94a0debf2c14362a6b4f22108d1ef12"}[30s])
Im not sure which model would work here, and which number should be divided?
Do we have any news here?
@SuperQ
Just wanted to check in on this one and see if there was anything additional we could provide?
Sorry, I've not had time to look into this recently. Someone needs to write some unit tests with data provided above so we can reproduce the issue.
@SuperQ
I just wanted to touch base on this one again, its still not working properly.
You posted above about working around the issue, but I wasn't really sure on how you were saying to do it. This was on on Aug 24, 2022. Any chance you can clarify what you meant?
As for writing the additional tests, if someone could provide more context I could probably get one of our dev's to take a stab at it. Outside of knowing it runs in your CI pipelines, i didnt see much more than a large make file and it wasnt clear to me where to start investigating.
Also, Im not sure if this is volunteer work for you or not, if it is, is there anything we can do privately to push this one over the line?
Thanks,
This can be easily fixed I assume by merging my PR https://github.com/prometheus-community/ecs_exporter/pull/49
Ok, fix is released, please give it a try.
I've tested it on several our ECS clusters, it looks good. I see few another issues: CPU metrics require normalisation and memory limit returns 1^64 for Fargate tasks, where container memory limit isn't set explicitly. But the initial issue definitely solved.
*There are tasks on common graph that were terminated due to scale down
Any chance you can post the working queries? About to deploy this to our ecs/far hate environments
Thanks
On Thu, Jan 26, 2023, 11:20 AM Roman Kulayev @.***> wrote:
I've tested it on several our ECS clusters, it looks good. I see few another issues: CPU metrics require normalisation and memory limit returns 1^64 for Fargate tasks, where container memory limit isn't set explicitly. But the initial issue definitely solved.
[image: Screenshot 2023-01-26 at 19 37 06] https://user-images.githubusercontent.com/5493637/214884104-0371d321-2f9e-4983-8803-3fefc9cea481.png
[image: Screenshot 2023-01-26 at 19 40 09] https://user-images.githubusercontent.com/5493637/214884139-cb866566-b577-452d-8975-6c23772bd812.png
*There are tasks on common graph that were terminated due to scale down
— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/ecs_exporter/issues/40#issuecomment-1405258240, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFBNZ5UIHDYXYHAAVMUTGTWUKP5BANCNFSM556JK73Q . You are receiving this because you authored the thread.Message ID: @.***>
I did a bit of research and figured out that AWS updates Task Metadata with Docker stats every 10s, so rate
is okay for that if you set interval like 1m
. Also, CPU utilization is in percents of one single core. So if you allocated something like .25vCPU
for task/container, then 25% means full utilization.
sum by (ecs_task_arn)(rate(ecs_cpu_seconds_total{ecs_task_name="$app"}[$interval])) * 100
sum by (container,ecs_task_arn)(rate(ecs_cpu_seconds_total{ecs_task_arn="$task"}[$interval])) * 100
sum by (ecs_task_arn)(ecs_memory_bytes{ecs_task_name="$app"})/1024/1024
sum by (container,ecs_task_arn)(ecs_memory_bytes{ecs_task_arn="$task"})/1024/1024
FYI, you want to use $__rate_interval
in Grafana.
@rkul
Do your graphs actually reflect whats shown in Cloudwatch ? Using your queries, we show things like 1-2% CPU usage, while cloudwatch will show 70%.
@SuperQ @roidelapluie @sysadmind
I think the latest release broke, most things.
Steps to reproduce:
image : "quay.io/prometheuscommunity/ecs-exporter:v0.2.0"
curl 10.1.110.149:9779/metrics > metrics.txt
What we are seeing.