lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
210 stars 39 forks source link

Implement heapselect algorithm #80

Closed lazear closed 1 year ago

lazear commented 1 year ago

Sage spends a significant portion of its runtime sorting (peptides, fragment ions, preliminary scoring candidates, etc.).

After preliminary scoring of all potential candidates (how many fragments matched?), Sage selects the top k (~50 or more, depending on settings) candidates for hyperscore calculation. Currently, the entire list of candidates (100s to 100,000s) is completely sorted, and then the best k elements are selected.

This PR merges in an implementation of the heapselect algorithm. A size-bounded min heap is constructed in-place to select the top k elements. This speeds up candidate selection, and can improve total runtime by 10-15% depending on the search settings.

quickcheck is used to perform property-based testing of the heapselect algorithm, and one of the integration tests has been changed to use quickcheck as well, to increase testing surface area. In addition, crate internals have been reworked a bit, to provide initial support for compiling for wasm (#77) - there is still some awkwardness here that warrants more clean up. Further restructuring/refactoring should enable support for additional file formats as well.

Of note: