single-cell-data / SOMA

A flexible and extensible API for annotated 2D matrix data stored in multiple underlying formats.
MIT License
69 stars 9 forks source link

SOMA Profiler RFCs #162

Open beroy opened 1 year ago

bkmartinjr commented 1 year ago

@beroy - one thing that would help me review this is some additional context and background information on the requirements and definition of success. Questions top of mind:

  1. Is the primary use case automating a smoke test to detect regressions as part of CI? Or is the goal to help diagnose/debug regressions once they are detected? Or both?
  2. If your goal is a diagnostic tool: my experience is that regressions are often I/O or schema related, or are highly obscured by concurrency issues. I am wondering if a focus on CPU/memory profiling is going to be useful in practice (vs. for example, the built-in TileDB query statistics). Put another way, if your goal is a diagnostic tool, is this the biggest bang for the buck? (very curious to hear what @Shelnutt2 and @gspowley think as I'm sure they have internal tooling that has covered some of this ground).
  3. If your goal is detection/smoke testing for regressions, and you think a historical log of past runs will be useful, do you want to capture core DB statistics?

Lastly (minor point) - we should probably discuss if this belongs in the TileDB-SOMA repo, or as a peer repo (motivated by a goal of keeping each repo simpler/focused - i.e., the old monorepo debate :-)

beroy commented 1 year ago

@bkmartinjr, thanks a lot for the great comments. We had a design review last week and discussed some of the related issues but did not cover all the questions here. I try to address them here: 1, 2) The main goals of this are a) figuring the hot spots (perf/mem) in our code stack (either python or R) and try to reduce the bottleneck b) the ability of detect regressions as a part of CI. The generic profiler only allows detection in (b) and not necessarily diagnosis as you mentioned. Diagnosis could be a lot more complicated than what software stack frame graphs can provide. 3) That's a great point. I did not think about it but thinking more, I really think that can be very useful if I can do that as custom profiler! That means we still have the general architecture but we also have a TileDB profiler that gets attached to the app and collects the TileDB stats. Not sure how possible it is to access TileDB API from outside a process. In the worst case we can add an access point or just a flag for doing so.

Finally, this repo suggestion seems to come from an agreement between TileDB and CZI for sharing RFCs. @maniarathi can comment here.

bkmartinjr commented 1 year ago

Hi @beroy,

design review last week

Did this review include the TileDB team? Most of the code is in their DB stack (not in SOMA or Census code), and that is also where most of the complexity and performance sensitive code is (e.g, github.com/TileDB/TileDB and github.com/TileDB/TileDB-Py). The core TileDB team has extensive experience with performance work, and I presume have a tool stack already in place. I just want to make sure we are filling a gap that exists, and that we benefit from the existing knowledge and tools to build upon.

TileDB stats

AFAIK, you can't capture them from outside the process, but enabling in-process collection has very little overhead and they are designed to be captured in the normal course of using the DB. Given that you must run the system with your own driver process, I suggest just capturing them as part of that driver process. The total size of stats summary is tiny and can be written to logs as part of running the core tests.

There are some near-term caveats to this due to our double-instantiation of the core DB, @gspowley and others can talk you through their plans to centralize everything (tactically,just capture stats fro both instances, or perhaps even ignore the second if you are focused on read-only perf). The stats API in SOMA is already exposed via tiledbsoma.tiledbsoma_stats_*, and there are equivalent API in the tiledb package for the second DB instance.

johnkerl commented 1 year ago

@gspowley and others can talk you through their plans to centralize everything

@bkmartinjr the current POC is @nguyenv

beroy commented 1 year ago

@bkmartinjr. TileDB people were not in the review but I discuss their profiling tools with @gspowley. While their tool capture some of the needed stuff the main focus is on distributed nodes. Also my goal is to capture the breakdown across python/R and C++. Your suggestion about collecting DB stats makes total sense (directly log them from the driver). BTW, here by core DB do you mean TileDB? Also, I'm not familiar with the double-instantiation plan. Will check with them

bkmartinjr commented 1 year ago

@beroy - I'm using "core DB" and "embedded TileDB" as synonyms.

We end up with two separate instantiations, with their own private memory/context/etc due to the way we are bootstrapping the SOMA C++ layer. As noted by @johnkerl, @nguyenv is POC on this (apologies for earlier misdirect). Using Python as an example (similar issue in R):

I only point this out as the separate core DB instances have their own stats data.

thetorpedodog commented 1 year ago

two non–content-related notes from me (these apply to both open RFCs right now so I am pasting this note in both places):

  1. Rebase so that you get the pre-commit format verification thingy
  2. (Optional) Consider formatting to one sentence per line. I did this in 3b5891c678e9d08f4c8817cb6426f82c02675cf0, but I didn’t formally write it down, so it’s not a hard Rule but it is a format I have found useful for editing.
atolopko-czi commented 11 months ago

@beroy @johnkerl As we now have a functioning profiler, can we merge the RFC and close this out?