Closed ssnn-airr closed 9 years ago
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
This is probably due to file I/O, in particular how Biopython SeqIO.index() accesses specific positions in the file. May need to implement an alternative to SeqIO.index() using the linecache library. Synchronizing the ordering in both files without loading all the sequences into memory is the primary obstacle.
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
Modified behavior to index one file and iterate the other. Also removed indexSeqPairs() step in favor of passing a key_function to SeqIO.index(). No longer output unpaired files, but it much faster.
Also removed indexSeqPairs() step from AssemblePairs and SplitSeq-samplepairs.
Original report by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
Index pairing algorithm in IgCore::indexSeqPairs does not scale well. Needs profiling and improvement. Possibly implement as hash table strategy instead of set intersection.