vivarium-collective / vivarium-core

Core Interface and Engine for Vivarium
https://vivarium-core.readthedocs.io/
Apache License 2.0
23 stars 2 forks source link

Parallel queries #219

Closed thalassemia closed 1 year ago

thalassemia commented 2 years ago

A single MongoDB query is unable to fully saturate I/O throughput on Google Compute Engine. This PR creates helper methods to intelligently split large queries into smaller, approximately equal chunks that can be run in parallel on separate OS processes. The number of chunks can be manually tuned to reach I/O saturation for maximum performance.

This was made possible by a new compound index: {experiment_id: 1, 'data.time': 1, _id: 1}. This index fully covers the aggregation step in get_data_chunks and represents a marked improvement over using any single-key index for queries that specify ranges or values for experiment_id or data.time. Best of all, compound indices maintain the ability to cover queries on prefixes.

As a single data point, an aggregation that previously took about 2 hours on GCE finishes in just 3 minutes when properly parallelized.

I also fixed a careless typo in setup.py. After this PR is taken care of, I think we are ready for a release.


By creating this pull request, I agree to the Contributor License Agreement, which is available in CLA.md at the top level of this repository.