immcantation / presto

pRESTO is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). pRESTO is a bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
https://presto.readthedocs.io
GNU Affero General Public License v3.0
0 stars 0 forks source link

PairSeq is slow on large files #1

Closed ssnn-airr closed 9 years ago

ssnn-airr commented 10 years ago

Original report by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Index pairing algorithm in IgCore::indexSeqPairs does not scale well. Needs profiling and improvement. Possibly implement as hash table strategy instead of set intersection.

ssnn-airr commented 9 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


This is probably due to file I/O, in particular how Biopython SeqIO.index() accesses specific positions in the file. May need to implement an alternative to SeqIO.index() using the linecache library. Synchronizing the ordering in both files without loading all the sequences into memory is the primary obstacle.

ssnn-airr commented 9 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Modified behavior to index one file and iterate the other. Also removed indexSeqPairs() step in favor of passing a key_function to SeqIO.index(). No longer output unpaired files, but it much faster.

Also removed indexSeqPairs() step from AssemblePairs and SplitSeq-samplepairs.