hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

Render custom stats in dashboard #18128

Open mr-karan opened 1 year ago

mr-karan commented 1 year ago

Hey

Proposing a feature request which I understand may not be very generic but I think it's quite useful to have:

stats {
    endpoint = "http://$NOMAD_ADDR:$NOMAD_PORT/stats"
    interval = "15s"
}

The job->group->task spec can take a stats stanza which is used to collect custom data from the given endpoint. It can be rendered in a table (spec can be decided, maybe a simple flat JSON key->value pair displayed as a table) on the task page

It'll be useful where job admins need to quickly exec, curl to /stats/metrics/healthz endpoints in their application and see the stats. This is not a replacement for metrics (for which Prometheus does an excellent job). This is just a useful feature to provide visibility in the state of the current application.

Happy to discuss more if this looks feasible enough.

tgross commented 1 year ago

Hi @mr-karan! So one point of clarification is that Nomad's state store isn't suitable for storing metrics long-term. So are you suggesting this as a source of ephemeral metrics data that the UI could hit?

In theory this is similar to the Nomad native service health checks except with being able render the displayed data more nicely in the UI. That would bypass a lot of tricky design issues here like whether the stats.endpoint path needs to be a full URI or not, whether it can work if not exposed on the public internet, etc.

mr-karan commented 1 year ago

that Nomad's state store isn't suitable for storing metrics long-term. So are you suggesting this as a source of ephemeral metrics data that the UI could hit?

Yep. Raft isn't ideal to store TSDB metrics. What I was suggesting was something similar to how we show CPU/Memory graphs on the task page. It doesn't store any data, only starts showing the metrics from the time the user has loaded the page.

In theory this is similar to the Nomad native service health checks except with being able render the displayed data more nicely in the UI

Yes, this is what I had in mind. Given that there's already an ability to query the health checks using HTTP, we can query an optional /stats endpoint and render the data as a table in UI. The schema of /stats endpoint can be as simple as flat JSON key value pairs.

What do you think?

tgross commented 1 year ago

Because of the way HTTP traffic gets turned into RPCs for forwarding around the cluster, it looks like implementation of something like this might be:

I my major concern with this design is that we're asking application developers to add a Nomad-specific endpoint to their applications, rather than letting them use something more industry-standard. I can't think of any other case where we've asked users to do this, and it likely wouldn't get used at all for off-the-shelf software. So that makes me question the ROI here.

I feel like there are other potential options here:

I'm going to mark this for roadmapping and further discussion. Anything we'd want to do here will need some interdisciplinary work between @mikenomitch for sussing out what we want to support, @juliezzhou for a sensible UX, @philrenaud for all the UI work, and the rest of engineering to do all the plumbing.