pmelsted / KmerStream

Streaming algorithm for computing kmer statistics for massive genomics datasets
53 stars 11 forks source link

Inconsistent predictions of G from ground truth set #10

Open tseemann opened 8 years ago

tseemann commented 8 years ago

Pall,

I have a bacterial genome 2866389 bp (2.8 Mbp) which is finished/closed and a set of Illumina PE reads to 90x coverage.

I ran the following for Q = 20, 10 and 0 and then put it through KmerStreamEstimate

 KmerStream -q Q -k 31,41,51,61,71,81,91,101,111,121 -o k.tab -t 36 --tsv R1.fq.gz R2.fq.gz

I am using git HEAD with the recent commits.

I was hoping to see some consistency with the G estimates, but it seems to be changing wrt k?

Q=20

Q   k    F0       f1      F1        F0-f1      G              e                 lambda
20  31   3299585  493126  99828961  2806459.0  2805991.89701  0.00494906985143  35.5770667429
20  41   3344589  557769  81730492  2786820.0  2786365.13834  0.00683562471453  29.3322977938
20  51   3333071  592388  66978637  2740683.0  2740263.7089   0.00885695513175  24.4424055913
20  61   3295172  616007  54567924  2679165.0  2678777.41407  0.0113030219247   20.3704584462
20  71   3235748  625295  43968174  2610453.0  2610103.65399  0.0142374984179   16.845374678
20  81   3133043  612159  34858661  2520884.0  2520623.97029  0.0175775131132   13.8293777299
20  91   3003070  596893  27015406  2406177.0  2406397.84817  0.0220979239132   11.2264919205
20  101  2853560  593094  20294110  2260466.0  2263858.37302  0.0290888014808   8.96439028247
20  111  2650170  593007  14587743  2057163.0  2075764.26773  0.0395590939439   7.02764915398
20  121  2404494  630370  9798492   1774124.0  1849163.75171  0.0579818605305   5.29887739307

Q=10

Q   k    F0        f1        F1         F0-f1      G              e                lambda
10  31   23238820  19188481  211932108  4050339.0  3419501.83239  0.0965584341374  61.97748046
10  41   24951970  21071837  193657981  3880133.0  3289632.00191  0.114964377917   58.8691929333
10  51   25741248  22060720  175436035  3680528.0  3140455.77674  0.131954691927   55.8632400746
10  61   25728340  22204961  157291715  3523379.0  3057818.75638  0.14713136109    51.4391883665
10  71   24981307  21547726  139242131  3433581.0  3060875.69114  0.160134075816   45.4909460725
10  81   23760034  20483993  121304895  3276041.0  2974236.75104  0.173863978665   40.7852182439
10  91   22183540  19087225  103504595  3096315.0  2854603.09556  0.189099634088   36.2588393325
10  101  20059862  17058780  85868079   3001082.0  2826521.24778  0.202742260699   30.3794210171
10  111  17461606  14690693  68423270   2770913.0  2645626.03974  0.218375670316   25.8627897413
10  121  14471688  11808783  51195091   2662905.0  2587409.56694  0.233618232371   19.7862339438

Q=0

Q  k    F0        f1        F1         F0-f1      G              e                lambda
0  31   23280809  19248612  211932108  4032197.0  3391482.35372  0.0969371952455  62.4895210696
0  41   25008847  21113700  193657981  3895147.0  3305218.88169  0.115174411602   58.5915753031
0  51   25662770  21889372  175436035  3773398.0  3263244.82718  0.130631908465   53.7612236565
0  61   25668828  22092345  157291715  3576483.0  3126446.66034  0.14621551104    50.3100586987
0  71   24997343  21488823  139242131  3508520.0  3148709.00282  0.159523840018   44.221975062
0  81   23847969  20582358  121304895  3265611.0  2959234.83715  0.17475086963    40.9919799122
0  91   22182579  19119747  103504595  3062832.0  2816923.81421  0.189495578172   36.7438389629
0  101  20085270  17120760  85868079   2964510.0  2786033.1511   0.20355585144    30.8209107153
0  111  17491351  14696703  68423270   2794648.0  2670449.51692  0.218431506673   25.6223791412
0  121  14489590  11867384  51195091   2622206.0  2544646.33846  0.234843802093   20.1187450792
pmelsted commented 8 years ago

Thanks for reporting this. The error rate/Genome size model isn't doing to well with high coverage. This is mostly due to the assumption that k-mers at distance 2 from the truth would not be observed twice by chance. Once you have higher coverage you would need to model that explicitly.

I'm working on extending the model so that it can account better for this. Additionally I'll have to replace the brentp method that is not working in #11

tseemann commented 7 years ago

I sense a Catch-22 however - how do I know I have too much depth until I know the genome size?