Open beroy opened 1 year ago
@bkmartinjr, thanks a lot for the great comments. We had a design review last week and discussed some of the related issues but did not cover all the questions here. I try to address them here: 1, 2) The main goals of this are a) figuring the hot spots (perf/mem) in our code stack (either python or R) and try to reduce the bottleneck b) the ability of detect regressions as a part of CI. The generic profiler only allows detection in (b) and not necessarily diagnosis as you mentioned. Diagnosis could be a lot more complicated than what software stack frame graphs can provide. 3) That's a great point. I did not think about it but thinking more, I really think that can be very useful if I can do that as custom profiler! That means we still have the general architecture but we also have a TileDB profiler that gets attached to the app and collects the TileDB stats. Not sure how possible it is to access TileDB API from outside a process. In the worst case we can add an access point or just a flag for doing so.
Finally, this repo suggestion seems to come from an agreement between TileDB and CZI for sharing RFCs. @maniarathi can comment here.
Hi @beroy,
design review last week
Did this review include the TileDB team? Most of the code is in their DB stack (not in SOMA or Census code), and that is also where most of the complexity and performance sensitive code is (e.g, github.com/TileDB/TileDB and github.com/TileDB/TileDB-Py). The core TileDB team has extensive experience with performance work, and I presume have a tool stack already in place. I just want to make sure we are filling a gap that exists, and that we benefit from the existing knowledge and tools to build upon.
TileDB stats
AFAIK, you can't capture them from outside the process, but enabling in-process collection has very little overhead and they are designed to be captured in the normal course of using the DB. Given that you must run the system with your own driver process, I suggest just capturing them as part of that driver process. The total size of stats summary is tiny and can be written to logs as part of running the core tests.
There are some near-term caveats to this due to our double-instantiation of the core DB, @gspowley and others can talk you through their plans to centralize everything (tactically,just capture stats fro both instances, or perhaps even ignore the second if you are focused on read-only perf). The stats API in SOMA is already exposed via tiledbsoma.tiledbsoma_stats_*
, and there are equivalent API in the tiledb
package for the second DB instance.
@gspowley and others can talk you through their plans to centralize everything
@bkmartinjr the current POC is @nguyenv
@bkmartinjr. TileDB people were not in the review but I discuss their profiling tools with @gspowley. While their tool capture some of the needed stuff the main focus is on distributed nodes. Also my goal is to capture the breakdown across python/R and C++. Your suggestion about collecting DB stats makes total sense (directly log them from the driver). BTW, here by core DB do you mean TileDB? Also, I'm not familiar with the double-instantiation plan. Will check with them
@beroy - I'm using "core DB" and "embedded TileDB" as synonyms.
We end up with two separate instantiations, with their own private memory/context/etc due to the way we are bootstrapping the SOMA C++ layer. As noted by @johnkerl, @nguyenv is POC on this (apologies for earlier misdirect). Using Python as an example (similar issue in R):
I only point this out as the separate core DB instances have their own stats data.
two non–content-related notes from me (these apply to both open RFCs right now so I am pasting this note in both places):
@beroy @johnkerl As we now have a functioning profiler, can we merge the RFC and close this out?
@beroy - one thing that would help me review this is some additional context and background information on the requirements and definition of success. Questions top of mind:
Lastly (minor point) - we should probably discuss if this belongs in the TileDB-SOMA repo, or as a peer repo (motivated by a goal of keeping each repo simpler/focused - i.e., the old monorepo debate :-)