Closed grst closed 10 months ago
MMSeqs2 is likely faster when ran on all sequences simultaneously and should produce the same results when ran with mmseqs align
:
https://github.com/soedinglab/mmseqs2/wiki
When used with mmseqs2 prefilter
it's even faster but only yields approximate results.
Calculating distances with mmseqs2 seems feasible. The alignment is ~30x faster compared to parasail. There will be some overhead for transforming the alignment scores into distances and for reading in the results, but we will likely end up somewhere >10x speedup, which sounds worthwile.
20k x 30k in ~30 sec (alignment only)
TODOs
alignment
metric into alignment_parasail
and alignment_mmseqs2
; alignment
will be deprecated and defaults to alignment_parasail
. mmseqs align
, but only for the query sequences. Possible solution: Align both query and target database with itself, but require sequence identity of 100%. Should be very fast. tcrdist3 might also be significantly faster than what scirpy is doing now. See https://github.com/scverse/scirpy/issues/286#issuecomment-1169430002
Parasail is highly optimized - but it is invoked for every sequence individually from Python which creates a bottleneck. We could solve this issue by integrating with external tools that have been developed for scalable comparison of immune-cell repertoires (both for levenshtein and alignment distance).
Possible methods to consider
Ideally, we could provide wrappers for several of them.