Open tac0turtle opened 2 years ago
Having metrics is a good idea generally, but I don't see a benefit of adding an arbitrary metrics endpoint to the storage interface. Any implementation of the interface can already expose whatever metrics it wants, and can set them up when the instance is constructed.
Generic performance metrics could maybe help characterize load (e.g., how many reads, how many writes, how much time spent in each), but won't give much insight about what's happening inside the database, without profiling the implementation. Backends have very different behaviours based on contention and type of load due to locking, caching policies, log structure, query structure, and so on.
For networks having live performance issues, a better place to start (and easier!) is to capture CPU and memory profiles from nodes under load. Besides being easier to do, profile data have the advantage of giving information about where the load comes from in the application, not just aggregate counts of high-level interface calls. Better still, capturing CPU and memory profiles is not blocked by SDK releases, so they can start immediately.
Another point to keep in mind is that we plan to move away from tm-db in either the current or next Tendermint Core release cycle. See TM RFC 001 for additional context. For that reason, I advise not building too much new much mechanism around tm-db. (Adding metrics to existing implementations is fine if it helps, though for them to do much good they'd still need to survive a version bump).
for sure, for goleveldb right now we have no insight into compaction info or some other parts. It is hard to gauge if we change something if it helps or not. We could do things based off of cpu and memory, was thinking it would be simpler to see the databases exported stats instead.
As for moving away from this repo, for tendermint its not a concern but the cosmos-sdk uses this as well where more of these problems are popping up. In the cosmos-sdk this repo will be used as the old store will be marked as deprecated and not removed for at least another year.
Ill do this quickly in a fork, it should be fairly easy.
for sure, for goleveldb right now we have no insight into compaction info or some other parts. It is hard to gauge if we change something if it helps or not. We could do things based off of cpu and memory, was thinking it would be simpler to see the databases exported stats instead.
Sure. And you don't need to do anything to the storage interface to support that. Just add whatever method you want to the implementation, and type assert for the presence of that method wherever you're setting up the exporter.
Right now on live networks we are running into large issues with databases but have no insight into what's going on in the db.
we should add a metrics method to the db interface and begin adding in prometheus metrics that apps and tendermint could export if they decide to. The cosmos-sdk will make the change to adopt this right away.