Prometheus Metrics - Githubissues

dtp263 commented 2 years ago

Proposed Change or Idea

Add prometheus metrics for monitoring the health of a sidecar instance.

Steps:

Add promethus client (possible client lib)
Add Counter increase to error middleware

First Metrics

Standard Prom Metrics
error_count

Additional Information

In order to provide a clean consistent service. You need to be able to easily monitor the health and status of each service.

TarikGul commented 2 years ago

It's definitely possible to integrate prometheus into Sidecar as we have the helm setup already intrinsic to the Repo. I am not totally sure on what the workload on integration would look like though - @ArshamTeymouri Any thoughts on this?

ArshamTeymouri commented 2 years ago

Hi Tarik!

That's a great idea to have the metrics exposed by the application. For this, we need a new endpoint on the application, say /metrics, that exposes application details in a Prometheus-compatible way. Please check Prometheus docs for more details about how to integrate those metrics in your application.

Besides, two additional endpoints should also be introduced, /readiness and liveness. Those endpoints will expose the general health of the application. e.g. If the app is crashing, the /liveness should return a non-200 HTTP response and if the application is working fine and is just waiting for its backing database to sync, /readiness should be non-200. This way Kubernetes knows whether it needs to restart the pod(application) because of a crash or wait until the application becomes ready to serve requests.

We have already done the same thing for one of the internal tools, Gitspiegel. You can reference its corresponding issue and PR for more details: https://github.com/paritytech/gitspiegel/issues/84

Once those endpoints are in place, I can proceed with configuring Prometheus to scrape the metrics and store them in its database. Later, you would be able to create your own dashboards on Grafana and visualize your application performance with fancy panels.

Feel free to reach me on Matrix if you need anything.

dtp263 commented 2 years ago

@ArshamTeymouri I love the idea of adding the readiness and liveness endpoints. But I wonder if it would make more sense to do that in a separate effort.

Implementing and exposing the basic metrics at /metrics seems like a nice objective block of work. Determining liveness / readiness could be a touch more subjective. Opening up the possibility of one blocking the other.

Imod7 commented 1 year ago

@ArshamTeymouri I was looking into the issue and I was wondering if it is recommended to run the /metrics endpoint in a different instance of express (different from what the API is running on) and different port. Maybe this would be a good idea so that the metrics are not exposed to the internet ? I am thinking of cases like the public instances of Sidecar or if another team uses Sidecar and also has public instances, maybe they would not want their metrics to be exposed ? Another advantage might be that if Sidecar has issues running in its server instance/port, the /metrics endpoint (if it is in a different port) will still be running right? So, then the fact that Sidecar is experiencing issues will be reflected in the corresponding metrics.

ArshamTeymouri commented 1 year ago

@Imod7 sorry for the late reply, I was on a long vacation. I'm afraid if it's possible to run the metrics endpoint in a different instance of API. I think it wouldn't be possible to gather metrics from another process by a sidecar. Most of the applications I know expose their metrics on the same api/port, there are always some ways for sysadmins/devops to secure the /metrics endpoint by acls on the webserver level so no worries for that. Though, I'd still suggest to check enterprise applications and their implementation as this topic sounds more developer side than infrastructure.

Imod7 commented 1 year ago

Current Status of this Issue

[x] The prometheus metrics endpoint is available in a dedicated server & port - [ status : DONE with PR #1232 ]
[ ] Add a liveness endpoint [ status : PENDING]
[x] Add a histogram of response times for each HTTP endpoint [ status : DONE with [PR #1465]

PierreBesson commented 1 year ago

I integrated metrics support into our substrate node helm-chart (https://github.com/paritytech/helm-charts/pull/284), However I don't see a lot of useful metrics exposed beyond the default nodejs cpu/mem. For example, it would be useful to have histogram of response times for each HTTP endpoint.

Imod7 commented 1 year ago

@PierreBesson thank you so much for the feedback! 💯 🙏 Yes, indeed, it does not have a lot of useful metrics right now since my goal through the 1st iteration was to build the base where later we could add any metrics we think are useful. Also, I think it needs a proper research on what metrics would be useful in a rest api and I have not done this (yet!) so your suggestion is a huge help. Please share if you have any other ideas/suggestions/recommendations/links on what I should check. For now, I will add the histogram metric in the pending list (my previous/comment above).

IkerAlus commented 1 year ago

what is the status on this one? @Imod7

Imod7 commented 1 year ago

what is the status on this one? @Imod7

The status is reflected in this comment which shows :

what was implemented : the addition of the metrics endpoint (with which we get the possibility to add any metrics we want) and one error counter metric.
what is pending : the two metrics suggested in previous comments (if they are still relevant)

Also, this comment shows that further research can be done to come up with other (than the 2 metrics suggested above) metrics that might useful.

filvecchiato commented 3 months ago

This PR integrated HTTP metrics for all routes [PR #1465 ]

paritytech / substrate-api-sidecar

Prometheus Metrics #1008

Proposed Change or Idea

Steps:

First Metrics

Additional Information