use the source field to match executors with their stats

TNorden commented 9 years ago

we are running mesos 0.23.0 with HubSpot's Singularity Framework, for some reason there were a handful of executors where the task id did not line up with the statistics' source field. This fixed the problem for us.

juliusv commented 9 years ago

Looks generally good to me, but I don't have personal experience with Mesos. Is the source (not sure what exactly that field is) unique enough between all frameworks, executors, and tasks to make this map assignment not have any unwanted collisions? https://github.com/evertrue/mesos_exporter/blob/master/main.go#L377

/cc @tommyulfsparre @antonlindstrom

StephanErb commented 9 years ago

@TNorden to what are you setting source in Singularity? Do you have an example how it looks like?

I somewhat fear that this might break stats reporting for tasks spawned using Aurora.

TNorden commented 9 years ago

@StephanErb Singularity takes care of setting the source, it basically manages our deploys to mesos. Here's and example of what I see when I hit the two endpoints used to gather metrics from the slaves:

/state.json

{
    "frameworks": [
        {
            "executors": [
                {
                    "source": "foo-2015.10.08T20:07:46UTC-1444334890919-1-NEW_DEPLOY-1444334888553",
                    "tasks": [
                        {
                            "id": "foo-2015.10.08T20:07:46UTC-1444334890934-1-mesos_agent_1d_2.priv"
                        }
                    ]
                },
                {
                    "source": "bar-1444159820-1444159823224-1-mesos_agent_1d_2.priv",
                    "tasks": [
                        {
                            "id": "bar-1444159820-1444159823224-1-mesos_agent_1d_2.priv"
                        }
                    ]
                }
            ]
        }
    ]
}

/monitor/statistics.json

[
    {
        "source": "foo-2015.10.08T20:07:46UTC-1444334890919-1-NEW_DEPLOY-1444334888553",
    },
    {
        "source": "bar-1444159820-1444159823224-1-mesos_agent_1d_2.priv",
    }
]

You can see that with the two examples sometimes source and id are the same, and using source from both endpoints seems to always line up. I'm fairly new to mesos so this might be only an issue with Singularity, but my gut tells me two fields with the same name should have the same value.

StephanErb commented 9 years ago

Just checked with Aurora:

/state.json

source: "www-data.prod.hello.0",
tasks: [{
    id: "1444421358359-www-data-prod-hello-0-592803ec-47c5-4dc9-9e03-8a3d10092a0b",
}]

/monitor/statistics.json

[{
  executor_id: "thermos-1444421358359-www-data-prod-hello-0-592803ec-47c5-4dc9-9e03-8a3d10092a0b",
  source: "www-data.prod.hello.0"
}]

That looks similar to your example and would not be broken by your patch (so my gut feeling was wrong here). Still, I believe there might be some other problem here.

/monitor/statistics.json lists resources per executor and uses the source as a back reference to the data in /state.json. As apparent in the latter, a single Mesos executor can run multiple tasks. I believe that if you want to have your task id line up with the statistics' source, you have to fix that directly within Singularity.

fabxc commented 9 years ago

I'm confused – this is clearly an overwrite. Either we are dropping something or don't need the inner loop to begin with. I'm not familiar with Mesos at all. It would be great if someone could clarify what's happening here.

StephanErb commented 9 years ago

For Aurora and Singularity each executor just runs one task, so the pull request kinda works. However it would not work for frameworks which start multiple tasks per executor.

StephanErb commented 8 years ago

I'll try to summarize the issue:

Mesos has a concept of frameworks, executors and tasks. A framework can have n executors and an executor can have n tasks. This structured is exposed via /state.json.
Mesos exposes resource metrics in /monitor/statistics.json per executor (and not per task!).
Most frameworks (Aurora, Singularity and probably also Marathon) always have just exactly one task per executor. This is why this patch/pull request tends to fix the scraping for these frameworks.

The root cause (of why scraping is currently broken) is that the code on master assumes that resource metrics are exposed per task, which is wrong.

juliusv commented 8 years ago

@StephanErb Thanks for the explanation!

Does that mean that the exporter's metrics structure is fundamentally broken then if resource metrics like mesos_task_cpu_system_seconds_total cannot actually be reported per task?

I have never touched Mesos, so I'm relying on other people to decide on the right solution here. I also know that @discordianfish was interested in replacing this Mesos exporter entirely with his own Mesos exporter: https://groups.google.com/forum/#!searchin/prometheus-developers/mesos$20johannes/prometheus-developers/8jVQlQIKo-M/gcN_P-S6AwAJ

All the Mesos people here have to come together and decide what's the best way forward :)

keis commented 8 years ago

My 2 cents

If metrics are provided by mesos per executor I makes sense to expose it the same way through prometheus. But I think it should be labeled by the executorId and not this source which I never really seen used anywhere.

Even if a bit of lie I still think it should report the taskId(s) with the metrics and in most situations it would just work. Repeating the metric for each task of the executor with a different task label would make it work for both. Only caveats with that if you actually have >1 tasks somewhere and you do a sum or avg you need to do that by executor and not by task.

brian-brazil commented 8 years ago

Only caveats with that if you actually have >1 tasks somewhere and you do a sum or avg you need to do that by executor and not by task.

This would make it incorrect to repeat the metric for each task as a sum isn't safe in all cases.

discordianfish commented 8 years ago

So yeah, I think that https://github.com/mesosphere/mesos_exporter provides better metrics. I might have missed something and unfortunately I'm not using mesos anymore. This and the concern of breaking existing users are the reasons why I haven't replaced this repo here yet. I'll open an issue and ping all people here again so we can discuss it there.

prometheus-junkyard / mesos_exporter

use the source field to match executors with their stats #15