Speed up ir_dist - Githubissues

grst commented 3 years ago

Parasail is highly optimized - but it is invoked for every sequence individually from Python which creates a bottleneck. We could solve this issue by integrating with external tools that have been developed for scalable comparison of immune-cell repertoires (both for levenshtein and alignment distance).

Possible methods to consider

[ ] MMSeqs2: Originally designed for metagenomics, this is a very fast multiple sequence alignment tool. See some of the discussion below.
[ ] compAIRR ( https://www.biorxiv.org/content/10.1101/2021.10.30.466600v2.full)
[ ] clusTCR (https://svalkiers.github.io/clusTCR/)
[ ] tcrdist3
... (there's probably many more)

Ideally, we could provide wrappers for several of them.

grst commented 3 years ago

MMSeqs2 is likely faster when ran on all sequences simultaneously and should produce the same results when ran with mmseqs align: https://github.com/soedinglab/mmseqs2/wiki

When used with mmseqs2 prefilter it's even faster but only yields approximate results.

Calculating distances with mmseqs2 seems feasible. The alignment is ~30x faster compared to parasail. There will be some overhead for transforming the alignment scores into distances and for reading in the results, but we will likely end up somewhere >10x speedup, which sounds worthwile.

20k x 30k in ~30 sec (alignment only)

TODOs

[ ] split up alignment metric into alignment_parasail and alignment_mmseqs2; alignment will be deprecated and defaults to alignment_parasail.
[ ] If all pairs are included, the result file can be huge. It can be pre-filtered based on sequence identity and alignment length. Figure out sensible values.
[ ] The alignment scores need to be converted into distances by subtracting them from the max achievable alignment score. Maybe possible to do that already on the command line with awk and the like? If I need to read in the result tables into python this will eat up a lot of runtime.
[ ] I need the self-alignment score of each sequence. There is an option for that in mmseqs align, but only for the query sequences. Possible solution: Align both query and target database with itself, but require sequence identity of 100%. Should be very fast.
[ ] Compare results to parasail

grst commented 2 years ago

tcrdist3 might also be significantly faster than what scirpy is doing now. See https://github.com/scverse/scirpy/issues/286#issuecomment-1169430002

scverse / scirpy

Speed up ir_dist #304