Expose database query / application metrics on internal `/metrics` endpoint

brahman81 commented 6 years ago

To assist during debugging and capacity planning, it would prove useful to expose database metrics on the /metrics Horizon endpoint.

Time spent performing the various database calls (offers, transactions, assets, accounts, etc)
Total requests per second to the core database
Total requests per second to the horizon database

Ideally we would namespace the metrics to distinguish Horizon vs Core database queries.

MonsieurNicolas commented 6 years ago

this should probably not be internet facing

brahman81 commented 6 years ago

Thanks @MonsieurNicolas, it makes sense to somehow restrict access to these extra database metrics.

A user could potentially restrict access to the /metrics endpoint before enabling these new db stats via a config option ? These are nice metrics to graph, especially when debugging or looking at capacity planning...

brahman81 commented 5 years ago

Having a second http listener started on an alternate port 8001 would be ideal imo, access can easily be restricted by most users and it would be a big ops win to be able to extract these types of metrics from Horizon.

I have dreams of Horizon metrics being in Prometheus, Grafana, etc.

brahman81 commented 5 years ago

Is instrumenting the application with https://github.com/prometheus/client_golang an option ? It would avoid the need for an external exporter and allow Prometheus to scrape Horizon directly ?

bartekn commented 4 years ago

Just added a couple PRs connected to this:

https://github.com/stellar/go/pull/2261 services/horizon: Move /metrics to internal server
https://github.com/stellar/go/pull/2265 services/horizon/actions: Add Prometheus text exposition format in /metrics
https://github.com/stellar/go/pull/2260 services/horizon: Add new ingestion system metrics to /metrics

When all are merged I'll deploy it to the staging server and we can try integrate it with our Prometheus server.

bartekn commented 4 years ago

All PRs above are merged. When it comes to DB metrics it requires a small refactor of support/db package so moving this to 1.1.0. cc @ire-and-curses.

bartekn commented 4 years ago

Is instrumenting the application with https://github.com/prometheus/client_golang an option ?

It's done in https://github.com/stellar/go/pull/2846. It should help adding more metrics soon.

@stellar/horizon-committers if you have ideas regarding new metrics please add them as a comment here. Here's my list:

Duration of the processing time for each ingestion processor. Per change/transaction breakdown.
Counter for each tx/op error type returned by txsub.
Duration of the order book graph state update per ledger.
LedgerEntryChangeCache compression ratio stats.

When it comes to SQL queries stats, I'm wondering if we should do it. First, majority of endpoints send a single SQL query to get results so we can easily track this using HTTP stats. Second, often we modify SQL query string for the same query type. Obvious example is inserts batch builders. We'd need to name each query and probably have a second param explaining the number of rows being added.

2opremio commented 4 years ago

If it's not already there. How about ingestion throughput (ledgers/time) and captive core stats (CPU and memory consumption of captive core). Also, the reingestion status (how many workers, what ledger ranges are being reingested, what's the progress in each of them).

On Fri, Jul 24, 2020, 16:05 Bartek Nowotarski notifications@github.com wrote:

Is instrumenting the application with https://github.com/prometheus/client_golang an option ?

It's done in #2846 https://github.com/stellar/go/pull/2846. It should help adding more metrics soon.

@stellar/horizon-committers https://github.com/orgs/stellar/teams/horizon-committers if you have ideas regarding new metrics please add them as a comment here. Here's my list:

Duration of the processing time for each ingestion processor. Per change/transaction breakdown.

Counter for each tx/op error type returned by txsub.

Duration of the order book graph state update per ledger.

LedgerEntryChangeCache https://godoc.org/github.com/stellar/go/exp/ingest/io#LedgerEntryChangeCache compression ratio stats.

When it comes to SQL queries stats, I'm wondering if we should do it. First, majority of endpoints send a single SQL query to get results so we can easily track this using HTTP stats. Second, often we modify SQL query string for the same query type. Obvious example is inserts batch builders. We'd need to name each query and probably have a second param explaining the number of rows being added.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/stellar/go/issues/620#issuecomment-663555467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASA4JEAK72LL4ZWECWJO73R5GIKPANCNFSM4FSADRSQ .

bartekn commented 4 years ago

If it's not already there. How about ingestion throughput (ledgers/time) and captive core stats (CPU and memory consumption of captive core). Also, the reingestion status (how many workers, what ledger ranges are being reingested, what's the progress in each of them).

I think you're talking about reingestion, right?

We already have a summary for processed ledgers (that includes a counter) but throughput in the online mode will be stable at 1 ledger per 5 seconds on average. I don't think we have Captive Core CPU and memory stats available via Go so it should be done at OS level. For reingestion stats (# of workers, throughput - makes sense here, progress per worker, etc.) 👍.

bartekn commented 4 years ago

Added one more metric here: https://github.com/stellar/go/pull/2921. Closing this, let's open a separate issue for each metric when it's really needed.

stellar / go

Expose database query / application metrics on internal `/metrics` endpoint #620