stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.12k stars 968 forks source link

Bogus database related metrics reported in `metrics` endpoint #2222

Open MonsieurNicolas opened 5 years ago

MonsieurNicolas commented 5 years ago

v11.3.0 Linux, running Postgresql

For some reason, many database related metrics show up with values of "0".

Glancing at the code, this doesn't seem to make sense as those metrics should only be queried/updated when running with sqlite (this is accessing the metrics endpoint directly).

Here is the list:

"database.insert.trustline"
"database.insert.trust"
"database.insert.state"
"database.insert.peer"
"database.insert.offer"
"database.insert.ledger-header-history"
"database.insert.data"
"database.insert.ban"
"database.insert.account"
"database.delete.txhistory"
"database.delete.txfeehistory"
"database.delete.trust"
"database.delete.state"
"database.delete.scpquorums"
"database.delete.quoruminfo"
"database.delete.ledger-header-history"
"database.delete.ledger-header"
"database.delete.ban"
"app.post-on-main-thread-with-delay.delay"
"database.select.publishqueue"
"database.select.trustline"
"database.select.txfeehistory"
"database.update.account"
"database.update.ban"
"database.update.data"
"database.update.ledger-header"
"database.update.ledger-header-history"
"database.update.offer"
"database.update.publishqueue"
"database.update.quoruminfo"
"database.update.scphistory"
"database.update.scpquorums"
"database.update.trust"
"database.update.trustline"
"database.update.txfeehistory"
"database.update.txhistory"
MonsieurNicolas commented 4 years ago

I think we should remove database related metrics that are that granular: the right way to track down performance issues with SQL is using SQL profiling, not with client side code (there are many tools that generate reports from SQL trace logs, for example https://pgbadger.darold.net/ )

graydon commented 4 years ago

These are zero valued because we moved to mostly upserts with the bulk query work last year.

I would like to keep these metrics. End-to-end latency on a database query from the client's perspective is important and we spent a lot of time tuning it, will likely have to go back to tuning it again someday. We bought enough performance room not to look at it in recent months, but that's not permanent.

MonsieurNicolas commented 4 years ago

Yeah for this, I suspect what we want is not metrics though but high quality, low overhead, trace logging (a problem we have for other subsystems).

MonsieurNicolas commented 2 years ago

Another problem with those bulk sql metrics is that we don't know how many entries are involved, so samples are quite misleading. For example, recently we have some ledgers that load many claimable balances, so the select_claimablebalance metrics shot up. I think what we probably want instead is to get an idea of how long we spend querying/updating tables on a per ledger basis

jonjove commented 2 years ago

@MonsieurNicolas that has been mentioned a number of times and there are probably one (or more) issues tracking it, so if it's fixed please be sure to close out those issues.

jonjove commented 2 years ago

I actually can't find an issue referencing that, which is shocking because we've known about it for years....