Closed brahman81 closed 4 years ago
this should probably not be internet facing
Thanks @MonsieurNicolas, it makes sense to somehow restrict access to these extra database metrics.
A user could potentially restrict access to the /metrics
endpoint before enabling these new db stats via a config option ? These are nice metrics to graph, especially when debugging or looking at capacity planning...
Having a second http listener started on an alternate port 8001
would be ideal imo, access can easily be restricted by most users and it would be a big ops win to be able to extract these types of metrics from Horizon.
I have dreams of Horizon metrics being in Prometheus, Grafana, etc.
Is instrumenting the application with https://github.com/prometheus/client_golang an option ? It would avoid the need for an external exporter and allow Prometheus to scrape Horizon directly ?
Just added a couple PRs connected to this:
When all are merged I'll deploy it to the staging server and we can try integrate it with our Prometheus server.
All PRs above are merged. When it comes to DB metrics it requires a small refactor of support/db
package so moving this to 1.1.0. cc @ire-and-curses.
Is instrumenting the application with https://github.com/prometheus/client_golang an option ?
It's done in https://github.com/stellar/go/pull/2846. It should help adding more metrics soon.
@stellar/horizon-committers if you have ideas regarding new metrics please add them as a comment here. Here's my list:
txsub
.LedgerEntryChangeCache
compression ratio stats.When it comes to SQL queries stats, I'm wondering if we should do it. First, majority of endpoints send a single SQL query to get results so we can easily track this using HTTP stats. Second, often we modify SQL query string for the same query type. Obvious example is inserts batch builders. We'd need to name each query and probably have a second param explaining the number of rows being added.
If it's not already there. How about ingestion throughput (ledgers/time) and captive core stats (CPU and memory consumption of captive core). Also, the reingestion status (how many workers, what ledger ranges are being reingested, what's the progress in each of them).
On Fri, Jul 24, 2020, 16:05 Bartek Nowotarski notifications@github.com wrote:
Is instrumenting the application with https://github.com/prometheus/client_golang an option ?
It's done in #2846 https://github.com/stellar/go/pull/2846. It should help adding more metrics soon.
@stellar/horizon-committers https://github.com/orgs/stellar/teams/horizon-committers if you have ideas regarding new metrics please add them as a comment here. Here's my list:
- Duration of the processing time for each ingestion processor. Per change/transaction breakdown.
- Counter for each tx/op error type returned by txsub.
- Duration of the order book graph state update per ledger.
- LedgerEntryChangeCache https://godoc.org/github.com/stellar/go/exp/ingest/io#LedgerEntryChangeCache compression ratio stats.
When it comes to SQL queries stats, I'm wondering if we should do it. First, majority of endpoints send a single SQL query to get results so we can easily track this using HTTP stats. Second, often we modify SQL query string for the same query type. Obvious example is inserts batch builders. We'd need to name each query and probably have a second param explaining the number of rows being added.
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/stellar/go/issues/620#issuecomment-663555467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASA4JEAK72LL4ZWECWJO73R5GIKPANCNFSM4FSADRSQ .
If it's not already there. How about ingestion throughput (ledgers/time) and captive core stats (CPU and memory consumption of captive core). Also, the reingestion status (how many workers, what ledger ranges are being reingested, what's the progress in each of them).
I think you're talking about reingestion, right?
We already have a summary for processed ledgers (that includes a counter) but throughput in the online mode will be stable at 1 ledger per 5 seconds on average. I don't think we have Captive Core CPU and memory stats available via Go so it should be done at OS level. For reingestion stats (# of workers, throughput - makes sense here, progress per worker, etc.) 👍.
Added one more metric here: https://github.com/stellar/go/pull/2921. Closing this, let's open a separate issue for each metric when it's really needed.
To assist during debugging and capacity planning, it would prove useful to expose database metrics on the
/metrics
Horizon endpoint.Ideally we would namespace the metrics to distinguish Horizon vs Core database queries.