qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:
https://qmarcou.github.io/IGoR/
GNU General Public License v3.0
47 stars 25 forks source link

Issue: sample size of more than 100000 sequences #40

Open kgrigaityte opened 5 years ago

kgrigaityte commented 5 years ago

Hello,

I'm trying to run Igor on my T cell receptor beta chain sequences, and everything works great until my sample size is above 100,000 sequences.

I'm getting the following error when using -evaluate command:

[IGoR] ERROR: Exception caught while reading J alignments before inference/evaluation. Make sure alignments were carried previously using "-align --J" or "-align --all" with similar path parameters (working directory, batchname, ...)

I have done -align -all, just like i did for all my other samples, and the the J_alignments file was generated in the aligns folder and looks fine. I tried splitting the sample in 4 files and doing all separately, which worked perfectly, so it shouldn't be a problem with the sequences. It is only when I use the whole file that I get that error.

Do you have any advise on how to go around this, or are there limitations with file sizes?

Thanks, Kristina

qmarcou commented 5 years ago

Hello @kgrigaityte , For now IGoR is loading all alignments in memory and store them there, I guess this strategy problematic upon running over large alignment files. You should have a second line in the error message giving you the error type. Could you please paste the complete error message (or just edit your post with the complete error message) ? There is a tradeoff between having to browse a large alignment file for every sequence on the fly (use virtually no memory but imposes to parse the complete file for each sequence) and storing every alignment in memory (uses a lot of memory and only parse the alignment file once). In order to reduce memory usage there are two paths you could exploit: