iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
93 stars 15 forks source link

Strange sa_intervals found in kmer.precalc for toy example #21

Closed iqbal-lab closed 8 years ago

iqbal-lab commented 8 years ago

PRG is:

ATCGCT5CCGCCGGCGA6G5TTTTT

prg.mask_alleles is 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 2 0 0 0 0 0 0

prg.mask_sites is 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5 0 5 0 0 0 0 0 0

This is the list of kmers generated by Carlos's script using the -n option:

CTC TCC CCG CGC GCC CGG GGC GCG CGA GAT ATT CTG TGT GTT ATC TCG GCT TTT

and this is the line of the precalc file corresponding to TTT

4 4 4 |1|19 22 22 23 |18 24 25 27 ||5 @|

So this thinks TTT can be found at interval (19,22) and (22,23)

here is the BWM of the PRG

5CCGCCGGCGA6G5TTTTTATCGCT 5TTTTTATCGCT5CCGCCGGCGA6G 6G5TTTTTATCGCT5CCGCCGGCGA A6G5TTTTTATCGCT5CCGCCGGCG ATCGCT5CCGCCGGCGA6G5TTTTT CCGCCGGCGA6G5TTTTTATCGCT5 CCGGCGA6G5TTTTTATCGCT5CCG CGA6G5TTTTTATCGCT5CCGCCGG CGCCGGCGA6G5TTTTTATCGCT5C CGCT5CCGCCGGCGA6G5TTTTTAT CGGCGA6G5TTTTTATCGCT5CCGC CT5CCGCCGGCGA6G5TTTTTATCG G5TTTTTATCGCT5CCGCCGGCGA6 GA6G5TTTTTATCGCT5CCGCCGGC GCCGGCGA6G5TTTTTATCGCT5CC GCGA6G5TTTTTATCGCT5CCGCCG GCT5CCGCCGGCGA6G5TTTTTATC GGCGA6G5TTTTTATCGCT5CCGCC T5CCGCCGGCGA6G5TTTTTATCGC TATCGCT5CCGCCGGCGA6G5TTTT index 19 TCGCT5CCGCCGGCGA6G5TTTTTA index 20 TTATCGCT5CCGCCGGCGA6G5TTT index 21 TTTATCGCT5CCGCCGGCGA6G5TT index 22 TTTTATCGCT5CCGCCGGCGA6G5T index 23 TTTTTATCGCT5CCGCCGGCGA6G5 index 24

reminder, so you don't need to scroll prg=ATCGCT5CCGCCGGCGA6G5TTTTT

The intervals (19,22) and (22,23) make no sense. Also bear in mind Sorina uses [a,b) intervals , so these should be [19,22) and [22,23). None of the rows 19-21 even start with TTT

iqbal-lab commented 8 years ago

OK, @coelias , @sm0179 - I'm looking into this. The code loads up into the kmer hash precisely the intervals found by bidir_search_bwd(), so I'm stepping through that to see why it thinks (19,22) is there

iqbal-lab commented 8 years ago

Just logging this for digestion later. If I delete the precalc file , and run, then I notice that as it is going through bidir_search_bwd, the range search returns (21,5). I don't see how row 21 could ever be associated with 5

iqbal-lab commented 8 years ago

Some more toy stuff:

PRG = A5C6G5T use kmer 2

This is the kmer file AC CT AG GT which is correct

and this is the precalc file 3 4 |0|3 4 |7 9 |5 @| 1 3 |0|1 2 |6 8 |5 2 @| 2 4 |0|2 3 |6 8 |5 @| 1 2 |0|1 2 |6 6 |5 1 @|

Notice it marks all the kmers as not in ref (marked by the |0|). This is a bit confusing. What kmers_in_ref means under the hood is, it does NOT have a match that is directly to the right of a number. That is true for all of these kmers.

So look at the first row 3 4 |0|3 4 |7 9 |5 @|

kmer GT has SA interval [3,4) - ie just the row 3.

bwm of A5C6G5T$ is $A5C6G5T 5C6G5T$A 5T$A5C6G 6G5T$A5C A5C6G5T$ C6G5T$A5 G5T$A5C6 T$A5C6G5

So that is right.

But now look at the second line 1 3 |0|1 2 |6 8 |5 2 @|

kmer AG has interval [1,2) - ie this line 5C6G5T$A - makes no sense (The site info is right - it overlaps site 5 on allele 2)

Another error - final line:

1 2 |0|1 2 |6 6 |5 1 @|

kmer AC has interval [1,2) - ie row 1, ie 5C6G5T$A - again makes no sense

iqbal-lab commented 8 years ago

For the record, the SA intervals generated are correct for a PRG with no variation:

PRG: AACGTT

kmers file

AA AC CG GT TT

precalc file

4 4 |1|6 7 |6 9 || 3 4 |1|4 5 |5 8 || 2 3 |1|3 4 |4 7 || 1 2 |1|2 3 |3 6 || 1 1 |1|1 2 |2 5 ||

which is all consistent with this BWM

$AACGTT AACGTT$ ACGTT$A CGTT$AA GTT$AAC <<<<<< T$AACGT TT$AACG

eg look at 2nd line: 34=GT and this is there on line indexed number 4, marked <<< above So SA interval is [4,5)

iqbal-lab commented 8 years ago

Actually the fastest way to see there is a problem is to look at this

3 4 |0|3 4 |7 9 |5 @| 1 3 |0|1 2 |6 8 |5 2 @| 2 4 |0|2 3 |6 8 |5 @| 1 2 |0|1 2 |6 6 |5 1 @|

both the kmers AG and AC have the same SA interval = [1,2). Must be impossible.

PRG is A5C6G5T

bwm of A5C6G5T is 5C6G5TA 5TA5C6G 6G5TA5C A5C6G5T C6G5TA5 G5TA5C6 TA5C6G5

bwm of A5C6G5T$ is $A5C6G5T 5C6G5T$A 5T$A5C6G 6G5T$A5C A5C6G5T$ C6G5T$A5 G5T$A5C6 T$A5C6G5

iqbal-lab commented 8 years ago

Hm - so this is all a mess as I was generating those BWMs during debugging without converting to integers. Will close this and create a new clean one

iqbal-lab commented 8 years ago

Close this - is a mess. Open a new clean one.