qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:
https://qmarcou.github.io/IGoR/
GNU General Public License v3.0
47 stars 25 forks source link

IGoR sequence generation - random number generation issue #1

Closed sinkovit closed 6 years ago

sinkovit commented 6 years ago

I’ve been using the IGoR software to generate simulated T cell beta chain repertoires and ran into an issue that I believe is related to the random number generation. My colleagues at Vanderbilt University Medical Center, Cinque Soto and James Crowe, suggested I get in touch with you.

We need to generate very large repertoires to compare with the results obtained from deep immunome sequencing being carried out at Vanderbilt. When I was processing IGoR output, I noticed that the same results are generated every 52,895,649 sequences. We discovered this when calculating the number of unique clonotypes as a function of the number of sequences – after 52M+ reads, the number of clonotypes was unchanged.

Here’s the igor command line that we’ve been using. Note that I tried this with two different seeds and get the same behavior and same cycle length (52,895,649 sequences).

igor -threads -1 -set_wd $PWD -batch tcr_beta -species human -chain beta \ -generate 1100000000 --CDR3 --seed 12345678 --noerr

For reasons that I don’t understand, the first few sequences are not repeated, but after a short time the random number generator settles into a predictable cycle.

The example grep output below shows how the 750th sequence is repeated. As you probably know, the “-n” option to grep prefixes the output with the line number for the matching line. You’ll notice that the difference in line numbers is always a multiple of 52,895,649 (e.g. 1,057,913,732 - 752 = 52,895,649 x 20)

$ grep -n 'TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT' \ generated_seqs_noerr_CDR3_info.csv

752:750,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 52896401:52896399,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 105792050:105792048,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 158687699:158687697,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 211583348:211583346,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 264478997:264478995,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 317374646:317374644,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 370270295:370270293,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 423165944:423165942,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 476061593:476061591,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 528957242:528957240,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 581852891:581852889,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 634748540:634748538,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 687644189:687644187,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 740539838:740539836,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 793435487:793435485,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 846331136:846331134,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 899226785:899226783,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 952122434:952122432,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 1005018083:1005018081,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0 1057913732:1057913730,TGTGCCAGCATCCCTACCTTTTGGTCACCGGGGGGGGAGCACAGATACGCAGTATTTT,1,0

$ grep -n '(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)' \ generated_realizations_noerr.csv

752:750;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 52896401:52896399;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 105792050:105792048;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 158687699:158687697;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 211583348:211583346;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 264478997:264478995;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 317374646:317374644;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 370270295:370270293;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 423165944:423165942;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 476061593:476061591;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 528957242:528957240;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 581852891:581852889;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 634748540:634748538;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 687644189:687644187;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 740539838:740539836;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 793435487:793435485;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 846331136:846331134;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 899226785:899226783;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 952122434:952122432;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 1005018083:1005018081;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2) 1057913732:1057913730;(78);(9);(1);(11);(12);(5);(5);(18);(3,1,1,1,3,0,1,1,3,3,3,3,2,2,3,1,0,1);(3);(0,2,2)

Any help would be greatly appreciated. Let me add that this is fantastic software. We’ve been looking for a rigorous way to generate simulated repertoires and we’re convinced that IGoR is the right way to do this.

qmarcou commented 6 years ago

Hi @sinkovit , Thank your very much for these positive comments! Thank you for this detailed bug report: This indeed seems to be related to the random engine that was used by IGoR. The C++ standard library _default_randomengine is a 32bits linear congruential engine whose state space was indeed too small (thus creating a periodic sequence generation) for large sequence simulations (as generating a sequence already requires drawing ~30 random numbers). I have now updated it to a 64 bits mersenne twister generator which period is ~10^6000, this should leave us some time before having hard disks able to accept such large datasets! After a few tests, this changes even seem to improve (very slightly though as hard disk writing is most likely the bottleneck) IGoR's sequence generation time. I have pushed the modifications in the dev branch and leave this issue open until I incorporate further polishing changes (such as creation of a good quality seed) from: http://www0.cs.ucl.ac.uk/staff/D.Jones/GoodPracticeRNG.pdf This changes will be incorporated in the next release. Please tell me if this does not solve your problem.

PS: although I guess you did it for debuging purposes, one should probably refrain to chose actually chose the seed (in your case 12345678) as it will most likely not exploit all available 64bits. Note that is is however mostly a problem if you're running many different simulations.

qmarcou commented 6 years ago

This issue is now fully fixed:

All this is already available on the dev branch and will constitute part of IGoR's v1.2.0 next release.