maplibre / martin

Blazing fast and lightweight PostGIS, MBtiles and PMtiles tile server, tile generation, and mbtiles tooling.
https://martin.maplibre.org
Apache License 2.0
2.18k stars 205 forks source link

feat: OpenMetrics / Prometheus / metrics endpoint (monitoring) #773

Open lefuturiste opened 1 year ago

lefuturiste commented 1 year ago

Hello, I will be interested to use a monitoring endpoint on martin. Of course I'm ready to work on it.

The need is to have some sort of monitoring endpoint to retrieve some metrics. I would be interested for:

I'm thinking this can be behind a feature flag in order to give the user a choice.

What are your take on this?

Implementation details:

we have the choice between two prometheus instrumentation libraries

the https://github.com/tikv/rust-prometheus library is compatible with the actix web instrumentation library https://github.com/nlopes/actix-web-prom

nyurik commented 1 year ago

Thanks, I love the idea! Moreover, these metrics should also be shown in the root web ui, and possibly even allow Ratatui CLI UI (via a cli flag). Obviously all these should not be in a single PR. I think it should be enabled by default (this way only those who want to use Martin as a library would disable it), and there could be a CLI flag like --no-metrics to disable it at runtime?

lefuturiste commented 1 year ago

Cool, I'm starting a POC with the nlopes/actix-web-prom lib. And probably we will add metrics to monitor performance of the PostgreSQL requests.

nyurik commented 1 year ago

@lefuturiste you may also be interested in charming crate to produce some nice graphs. This way we can have an endpoint like /_/graph/pie_by_ret_code.svg or /_/graph/pie_by_source.svg to produce a pie chart SVG image of which HTTP code was returned to the user or how many requests were made to each source. Eventually we could have some admin interface (something that can be easily disabled by nginx proxy or a CLI flag) that shows some stats.

lefuturiste commented 1 year ago

@lefuturiste you may also be interested in charming crate to produce some nice graphs. This way we can have an endpoint like /_/graph/pie_by_ret_code.svg or /_/graph/pie_by_source.svg to produce a pie chart SVG image of which HTTP code was returned to the user or how many requests were made to each source. Eventually we could have some admin interface (something that can be easily disabled by nginx proxy or a CLI flag) that shows some stats.

I will prioritize the instrumentation of the software.

Your ideas sound cool, but I personally don't think that including software to visualize the data is a good idea. I think that we should separate this kind of features and let the user have the choice on how their want to view the data.

At least for my use case, I don't need it since I will be using grafana to visualize the data and analyze how martin behave in production.

nyurik commented 1 year ago

Fair point. Technically the charming crate uses Apache javascript library too, so as long as Martin can provide some statistics via an api of sorts, javascript could do all of that. Martin can bake-in various javascript libs during the build step if needed.

lefuturiste commented 1 year ago

Okay, now for the main issue, which is getting the metrics out. I have a problem. By default the nlopes/actix-web-prom crate will get use the path pattern in order to reduce cardinality of labels so it gives something like this (and after analyzing the code for actix-web-prom it has no way to configure that).

martin_http_requests_total{endpoint="/{source_ids}/{z}/{x}/{y}",method="GET",status="200"} 10

but for my need I still want to have different metric families for each {source_ids} variants. For that, I will need to change the implementation of the actix-web-prom crate.

I imagine some sort of attribute macro on the route, I'm currently trying out to find out what's the best method for a dev to tell to actix-web-prom: "Hey for this route I will need to keep the cardinality for the {source_ids} param".

I will open an issue upstream and may be open a MR. But depending on the time it will take to merge it, we may have to use a workaround or not have this feature at all, or use our own version of the crate for more flexibility.