Partial caching - Githubissues

As predicted, the performance of the application for datasets with a large number of samples has decreased in the new architecture (parallelized, stateless workers). Since as part of that work the binary payloads representing each sample's data have been reduced greatly, it should be feasible to have the worker pods cache the intermediate data and reduce the internal bandwidth usage, database connections, etc.

There are two levels possible:

Just cache, in memory in the pods, the payloads coming from the database for a given sample, provided that the payload is one of the relatively small ones. This can be subject to a liberal LRU eviction policy, say after roughly 100MB of cache.
Also cache, in memory in the pods, for each metric type requested, the metric-specific data structure which is created just before computation.

In either case, the pods which have cached some sample (case 1) or some sample preloaded-for-a-metric (case 2) should preferentially take jobs from the queue which they can perform better than the others due to this caching, i.e. jobs for those cached samples.

I think that this strategy can be used rather than a previously considered strategy to restore the monolithic per-metric-type, data-preloaded containers only in the case of the small datasets. I think it is better to have a single design and not try to support too many different computation pipelines.

nadeemlab / SPT

Partial caching #336