Open TNorden opened 9 years ago
Looks generally good to me, but I don't have personal experience with Mesos. Is the source (not sure what exactly that field is) unique enough between all frameworks, executors, and tasks to make this map assignment not have any unwanted collisions? https://github.com/evertrue/mesos_exporter/blob/master/main.go#L377
/cc @tommyulfsparre @antonlindstrom
@TNorden to what are you setting source
in Singularity? Do you have an example how it looks like?
I somewhat fear that this might break stats reporting for tasks spawned using Aurora.
@StephanErb Singularity takes care of setting the source, it basically manages our deploys to mesos. Here's and example of what I see when I hit the two endpoints used to gather metrics from the slaves:
/state.json
{
"frameworks": [
{
"executors": [
{
"source": "foo-2015.10.08T20:07:46UTC-1444334890919-1-NEW_DEPLOY-1444334888553",
"tasks": [
{
"id": "foo-2015.10.08T20:07:46UTC-1444334890934-1-mesos_agent_1d_2.priv"
}
]
},
{
"source": "bar-1444159820-1444159823224-1-mesos_agent_1d_2.priv",
"tasks": [
{
"id": "bar-1444159820-1444159823224-1-mesos_agent_1d_2.priv"
}
]
}
]
}
]
}
/monitor/statistics.json
[
{
"source": "foo-2015.10.08T20:07:46UTC-1444334890919-1-NEW_DEPLOY-1444334888553",
},
{
"source": "bar-1444159820-1444159823224-1-mesos_agent_1d_2.priv",
}
]
You can see that with the two examples sometimes source
and id
are the same, and using source
from both endpoints seems to always line up. I'm fairly new to mesos so this might be only an issue with Singularity, but my gut tells me two fields with the same name should have the same value.
Just checked with Aurora:
/state.json
source: "www-data.prod.hello.0",
tasks: [{
id: "1444421358359-www-data-prod-hello-0-592803ec-47c5-4dc9-9e03-8a3d10092a0b",
}]
/monitor/statistics.json
[{
executor_id: "thermos-1444421358359-www-data-prod-hello-0-592803ec-47c5-4dc9-9e03-8a3d10092a0b",
source: "www-data.prod.hello.0"
}]
That looks similar to your example and would not be broken by your patch (so my gut feeling was wrong here). Still, I believe there might be some other problem here.
/monitor/statistics.json
lists resources per executor and uses the source
as a back reference to the data in /state.json
. As apparent in the latter, a single Mesos executor can run multiple tasks. I believe that if you want to have your task id line up with the statistics' source, you have to fix that directly within Singularity.
I'm confused – this is clearly an overwrite. Either we are dropping something or don't need the inner loop to begin with. I'm not familiar with Mesos at all. It would be great if someone could clarify what's happening here.
For Aurora and Singularity each executor just runs one task, so the pull request kinda works. However it would not work for frameworks which start multiple tasks per executor.
I'll try to summarize the issue:
n
executors and an executor can have n
tasks. This structured is exposed via /state.json
./monitor/statistics.json
per executor (and not per task!). The root cause (of why scraping is currently broken) is that the code on master
assumes that resource metrics are exposed per task, which is wrong.
@StephanErb Thanks for the explanation!
Does that mean that the exporter's metrics structure is fundamentally broken then if resource metrics like mesos_task_cpu_system_seconds_total
cannot actually be reported per task?
I have never touched Mesos, so I'm relying on other people to decide on the right solution here. I also know that @discordianfish was interested in replacing this Mesos exporter entirely with his own Mesos exporter: https://groups.google.com/forum/#!searchin/prometheus-developers/mesos$20johannes/prometheus-developers/8jVQlQIKo-M/gcN_P-S6AwAJ
All the Mesos people here have to come together and decide what's the best way forward :)
My 2 cents
If metrics are provided by mesos per executor I makes sense to expose it the same way through prometheus. But I think it should be labeled by the executorId and not this source which I never really seen used anywhere.
Even if a bit of lie I still think it should report the taskId(s) with the metrics and in most situations it would just work. Repeating the metric for each task of the executor with a different task label would make it work for both. Only caveats with that if you actually have >1 tasks somewhere and you do a sum or avg you need to do that by executor and not by task.
Only caveats with that if you actually have >1 tasks somewhere and you do a sum or avg you need to do that by executor and not by task.
This would make it incorrect to repeat the metric for each task as a sum isn't safe in all cases.
So yeah, I think that https://github.com/mesosphere/mesos_exporter provides better metrics. I might have missed something and unfortunately I'm not using mesos anymore. This and the concern of breaking existing users are the reasons why I haven't replaced this repo here yet. I'll open an issue and ping all people here again so we can discuss it there.
we are running mesos 0.23.0 with HubSpot's Singularity Framework, for some reason there were a handful of executors where the task id did not line up with the statistics' source field. This fixed the problem for us.