Issue: sample size of more than 100000 sequences

qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:

GNU General Public License v3.0

47 stars 25 forks source link

Hello,

I'm trying to run Igor on my T cell receptor beta chain sequences, and everything works great until my sample size is above 100,000 sequences.

I'm getting the following error when using -evaluate command:

[IGoR] ERROR: Exception caught while reading J alignments before inference/evaluation. Make sure alignments were carried previously using "-align --J" or "-align --all" with similar path parameters (working directory, batchname, ...)

I have done -align -all, just like i did for all my other samples, and the the J_alignments file was generated in the aligns folder and looks fine. I tried splitting the sample in 4 files and doing all separately, which worked perfectly, so it shouldn't be a problem with the sequences. It is only when I use the whole file that I get that error.

Do you have any advise on how to go around this, or are there limitations with file sizes?

Thanks, Kristina

Hello @kgrigaityte , For now IGoR is loading all alignments in memory and store them there, I guess this strategy problematic upon running over large alignment files. You should have a second line in the error message giving you the error type. Could you please paste the complete error message (or just edit your post with the complete error message) ? There is a tradeoff between having to browse a large alignment file for every sequence on the fly (use virtually no memory but imposes to parse the complete file for each sequence) and storing every alignment in memory (uses a lot of memory and only parse the alignment file once). In order to reduce memory usage there are two paths you could exploit:

have a more drastic filtering on alignments upon aligning or reading alignments, by playing with alignment score thresholds or relative score thresholds (although now that I think about it I am not sure I have created a command line option for the latter yet).
try and shorten your gene names (if you're using the IMGT complete name, the string will take up a lot of memory compared to a shorter name). This may sound silly but may be a real problem for large sequence sets. I'm a bit busy at the moment but I'll try and spend some time find a better tradeoff in terms of input reading for large dataset once I get some time Hope this helps!

qmarcou / IGoR

Issue: sample size of more than 100000 sequences #40