Make ondemand services parallelizable

jimmymathews commented 3 months ago

The ondemand services are not easily scaled or parallelized. Currently they load a version of all source data at initialization time.

Deprecate the initialization-time loads for the ondemand services.
Implement per-sample loads (at the cost of greater backend database bandwidth usage). This can probably substantially re-use the recently implemented binary cell data puller.
Convert the custom TCP server to an HTTP service (perhaps also FastAPI):
- to simplify the call signatures and argument handling, and
- to delegate multiple or concurrent request handling to FastAPI.
Alter the TCP client to dispatch separate jobs or requests for each sample.
Ensure that the service containers emit workload-indicating signals that can trigger appropriate autoscalling.

jimmymathews commented 2 months ago

Ultimately this called for a more complete reorganization than suggested above. What is implemented in branch issue326 does (1), (2), and an improvement over (3), (4), (5).

Deprecation. The TCP server for computing ondemand metrics, and the client, have been converted entirely to a computation worker with no direct IPC communication (TCP or HTTP, etc.). The cell data endpoint which relied on a dedicated service container has been finally deprecated (as promised in a previous issue). The pending_feature_computation table has also been deprecated.

Overall new behavior. Instead, there are a number of worker instances that listen to a postgres notification for new jobs in a queue, and pick up jobs when they are available. These jobs are per-sample and write their results back to the database. So all the metrics (including counts) are computed entirely on-demand, at the expense of database bandwidth for pulling the compressed binary cell data for each sample every time a metric is requested. Start up time for the application is correspondingly short, just a few seconds, compared to the former 25 minutes or so (previously devoted to building local in-RAM cache to bypass the database during normal runtime).

A few specific new behaviors. There are a few new behaviors, so this work is not only a refactor:

Internally, the cell counts features are cached in the database alongside all the other features, as they are computed one by one for each sample. So in the future a streaming/real-time progress endpoint could be supported (as it could be for the other metrics, which also use such a system).
To support the above, feature specifications may be augmented by cell sets, as in the case of the counts along the gnn-important subsets. This is implemented as a new database table, cell_set_cache.

Infrastructure changes to support. Now a cluster of several low-resource VMs can perform in parallel what was previously performed serially on high-resource machines. In the case of large samples the parallelization speed up that this entails is perhaps 100 times or more (I was never able to complete the computations under the former regime, for comparison). In the case of a large number of smaller samples, e.g. the case with ~500 samples, there is possibly a slowdown in the current deployment configuration, since the parallelization is not yet pushed to its capacity (only about 20 jobs run at once), and there is significant per-sample overhead now, including lots of database connections.

Tests. The test suite passes, so the integrity of the capacity to compute metrics is substantially preserved.

Changed values. However, in the full real-dataset reanalysis scripts, a large number of summary statistics (about a quarter) differ by a few percentage points compared to the previous implementation, despite no changes to the underlying per-metric-type implementations (proximity ball trees, squidpy, etc.). I determined that many of these discrepancies are due to edge cases where former null values have become 0, skewing the subsequent statistics. But I can't be sure that every case is of this kind. It is possible that the new implementation, which starts every metric computation from the same slimmed-down sample data payload, fixes some bug in the former implementation. Such bugs could have been due to the circuitous way that whole-study data was loaded in various cryptic dict objects. In any case, the discrepancies don't seem to be too severe. Only a handful are in the 30% range or so.

Robust. In a large number of tests cases, and some sporadic concurrent usage scenarios, the system has not had any crashes or indefinite hangs. Overall the deployed system no longer has any of the bugs that were introduced at the beginning of this branch's work:

No more "multiple/duplicate feature specifications" getting written to the database, causing unnecessary recomputation or "cross threaded" misattribution of computed values.
No more deadlocks due to faulty usage of postgres NOTIFY/LISTEN.
No more "watcher" service keeping track of the job containers and orchestrating their work -- this was complex and prone to deadlock errors, and ultimately the need for such a service was removed.

Wrap-up. What remains for issue326 is to fully deprecate some of the attempted features that turned out to be unnecessary (like the object references packed into the postgres notify payloads), and cleaning up the debug-time alterations.

jimmymathews commented 2 months ago

A few sporadic improvements were also made:

Most tests in the test suite can now be cancelled properly with Ctrl-C, as the signal is properly handled by the docker run command.
The test logs are more uniform due to fixed-width naming of the containers.

nadeemlab / SPT

Make ondemand services parallelizable #326