Closed iqbal-lab closed 8 years ago
OK, @coelias , @sm0179 - I'm looking into this. The code loads up into the kmer hash precisely the intervals found by bidir_search_bwd(), so I'm stepping through that to see why it thinks (19,22) is there
Just logging this for digestion later. If I delete the precalc file , and run, then I notice that as it is going through bidir_search_bwd, the range search returns (21,5). I don't see how row 21 could ever be associated with 5
Some more toy stuff:
PRG = A5C6G5T use kmer 2
This is the kmer file AC CT AG GT which is correct
and this is the precalc file 3 4 |0|3 4 |7 9 |5 @| 1 3 |0|1 2 |6 8 |5 2 @| 2 4 |0|2 3 |6 8 |5 @| 1 2 |0|1 2 |6 6 |5 1 @|
Notice it marks all the kmers as not in ref (marked by the |0|). This is a bit confusing. What kmers_in_ref means under the hood is, it does NOT have a match that is directly to the right of a number. That is true for all of these kmers.
So look at the first row 3 4 |0|3 4 |7 9 |5 @|
kmer GT has SA interval [3,4) - ie just the row 3.
bwm of A5C6G5T$ is $A5C6G5T 5C6G5T$A 5T$A5C6G 6G5T$A5C A5C6G5T$ C6G5T$A5 G5T$A5C6 T$A5C6G5
So that is right.
But now look at the second line 1 3 |0|1 2 |6 8 |5 2 @|
kmer AG has interval [1,2) - ie this line 5C6G5T$A - makes no sense (The site info is right - it overlaps site 5 on allele 2)
Another error - final line:
1 2 |0|1 2 |6 6 |5 1 @|
kmer AC has interval [1,2) - ie row 1, ie 5C6G5T$A - again makes no sense
For the record, the SA intervals generated are correct for a PRG with no variation:
PRG: AACGTT
kmers file
AA AC CG GT TT
precalc file
4 4 |1|6 7 |6 9 || 3 4 |1|4 5 |5 8 || 2 3 |1|3 4 |4 7 || 1 2 |1|2 3 |3 6 || 1 1 |1|1 2 |2 5 ||
which is all consistent with this BWM
$AACGTT AACGTT$ ACGTT$A CGTT$AA GTT$AAC <<<<<< T$AACGT TT$AACG
eg look at 2nd line: 34=GT and this is there on line indexed number 4, marked <<< above So SA interval is [4,5)
Actually the fastest way to see there is a problem is to look at this
3 4 |0|3 4 |7 9 |5 @| 1 3 |0|1 2 |6 8 |5 2 @| 2 4 |0|2 3 |6 8 |5 @| 1 2 |0|1 2 |6 6 |5 1 @|
both the kmers AG and AC have the same SA interval = [1,2). Must be impossible.
PRG is A5C6G5T
bwm of A5C6G5T is 5C6G5TA 5TA5C6G 6G5TA5C A5C6G5T C6G5TA5 G5TA5C6 TA5C6G5
bwm of A5C6G5T$ is $A5C6G5T 5C6G5T$A 5T$A5C6G 6G5T$A5C A5C6G5T$ C6G5T$A5 G5T$A5C6 T$A5C6G5
Hm - so this is all a mess as I was generating those BWMs during debugging without converting to integers. Will close this and create a new clean one
Close this - is a mess. Open a new clean one.
PRG is:
ATCGCT5CCGCCGGCGA6G5TTTTT
prg.mask_alleles is 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 2 0 0 0 0 0 0
prg.mask_sites is 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5 0 5 0 0 0 0 0 0
This is the list of kmers generated by Carlos's script using the -n option:
CTC TCC CCG CGC GCC CGG GGC GCG CGA GAT ATT CTG TGT GTT ATC TCG GCT TTT
and this is the line of the precalc file corresponding to TTT
4 4 4 |1|19 22 22 23 |18 24 25 27 ||5 @|
So this thinks TTT can be found at interval (19,22) and (22,23)
here is the BWM of the PRG
5CCGCCGGCGA6G5TTTTTATCGCT 5TTTTTATCGCT5CCGCCGGCGA6G 6G5TTTTTATCGCT5CCGCCGGCGA A6G5TTTTTATCGCT5CCGCCGGCG ATCGCT5CCGCCGGCGA6G5TTTTT CCGCCGGCGA6G5TTTTTATCGCT5 CCGGCGA6G5TTTTTATCGCT5CCG CGA6G5TTTTTATCGCT5CCGCCGG CGCCGGCGA6G5TTTTTATCGCT5C CGCT5CCGCCGGCGA6G5TTTTTAT CGGCGA6G5TTTTTATCGCT5CCGC CT5CCGCCGGCGA6G5TTTTTATCG G5TTTTTATCGCT5CCGCCGGCGA6 GA6G5TTTTTATCGCT5CCGCCGGC GCCGGCGA6G5TTTTTATCGCT5CC GCGA6G5TTTTTATCGCT5CCGCCG GCT5CCGCCGGCGA6G5TTTTTATC GGCGA6G5TTTTTATCGCT5CCGCC T5CCGCCGGCGA6G5TTTTTATCGC TATCGCT5CCGCCGGCGA6G5TTTT index 19 TCGCT5CCGCCGGCGA6G5TTTTTA index 20 TTATCGCT5CCGCCGGCGA6G5TTT index 21 TTTATCGCT5CCGCCGGCGA6G5TT index 22 TTTTATCGCT5CCGCCGGCGA6G5T index 23 TTTTTATCGCT5CCGCCGGCGA6G5 index 24
reminder, so you don't need to scroll prg=ATCGCT5CCGCCGGCGA6G5TTTTT
The intervals (19,22) and (22,23) make no sense. Also bear in mind Sorina uses [a,b) intervals , so these should be [19,22) and [22,23). None of the rows 19-21 even start with TTT