iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Unexpected assertion for specific kmer #35

Closed iqbal-lab closed 7 years ago

iqbal-lab commented 8 years ago

Sorina spotted some samples were hitting/failing an assertion. Basically, if a read was mapping horizontally uniquely but not vertically uniquely, this assert was checking that at that time, nothing had been pushed onto the sites list, in the knowledge that it was going to be pushed after exiting the current function. Actually the problem can be reproduced using just the last kmer of the read, and putting it in the kmers input file.

So, what is the command?

gramtools -p Chr10_div_in_msp34_prg.in 
-c csa_file -i test.fastq -s Mask_sites -a Mask_alleles 
-v Coverages -r Reads_processed -b int_alphabet_file 
-l csa_memory -k 9 -f  onekmer

where

cat onekmer
TATGAAGAA

cat test.fastq
@0
AAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAGAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

OK, so what happens. First of all, if you look in the PRG file, you can see that the kmer overlaps site 79, and overlaps the end of many alleles, with AA poking out of the right hand end of the site. So, when we precalculate the sa_interval and sites for this kmer, you should see all these alleles in the sites object. But in fact somehow it reduces it all down to just alleles 24 and 34.

I guess the thing to do is run it in gdb and look at where sa_intervals and sites are erased.

iqbal-lab commented 8 years ago

These are the alleles at site 79 - you will need to scroll right to see the end, and to notice I have marked alleles 24 and 34

AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCAT**TATGAAG**
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCCAAAAAGGATGAATGTAAAGGTGCATTAAAGCAT**TATGAAG**
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGAATGATTCACAACTATTAGAAATATCCAAAAAGGATGAATGTAAAGGTGCATTAAAGCAT**TATGAAG**
AGGATGTAATAAAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAACAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAGAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATGAATGTAAAGAAGCATTAAAGCATTATGAAG
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAA
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGGTGCATTAAAGAATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAAATTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGAATTATGAAG
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAACAAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATGTCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAAGACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGAATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAATAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAA
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AAAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG 24
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCCAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATCAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGGTTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGATGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAATTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTGAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTTACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AAGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG 34
AGGATGTAATAAAACTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGAATGTAATAGAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG
AGGATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGGAGCATTAAAGCATTATGAAG
AGGATGTAATAAAATTATGTAATATCCAACAATTTACAAACCAGGATGATTCACAACTATTAGAAATATCAAAAGAGGATAAATGTAAAGGTGCATTAAAGCATTATGAAG
AGAATGTAATAGAATTATGTAATGTAAGACAAATTACAAACCAGAATGGTTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAG

The next two bases after this site are AA

So our kmer TATGAAGAA overlaps many (in fact all except two) of these alleles, running past the right hand 79 marker and then finishing on the AA adjacent. I've highlighted the first few.

iqbal-lab commented 7 years ago

I think this issue is v old andmaybe dead @sm0179 ?

sm0179 commented 7 years ago

yes