scverse / scirpy

A scanpy extension to analyse single-cell TCR and BCR data.
https://scirpy.scverse.org/en/latest/
BSD 3-Clause "New" or "Revised" License
220 stars 34 forks source link

Speed up ir_dist #304

Closed grst closed 10 months ago

grst commented 3 years ago

Parasail is highly optimized - but it is invoked for every sequence individually from Python which creates a bottleneck. We could solve this issue by integrating with external tools that have been developed for scalable comparison of immune-cell repertoires (both for levenshtein and alignment distance).

Possible methods to consider

Ideally, we could provide wrappers for several of them.

grst commented 3 years ago

MMSeqs2 is likely faster when ran on all sequences simultaneously and should produce the same results when ran with mmseqs align: https://github.com/soedinglab/mmseqs2/wiki

When used with mmseqs2 prefilter it's even faster but only yields approximate results.


Calculating distances with mmseqs2 seems feasible. The alignment is ~30x faster compared to parasail. There will be some overhead for transforming the alignment scores into distances and for reading in the results, but we will likely end up somewhere >10x speedup, which sounds worthwile.

20k x 30k in ~30 sec (alignment only)

TODOs

grst commented 2 years ago

tcrdist3 might also be significantly faster than what scirpy is doing now. See https://github.com/scverse/scirpy/issues/286#issuecomment-1169430002