iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Differences in gene presence/absence #210

Open LeahRoberts opened 4 years ago

LeahRoberts commented 4 years ago

I have been looking at differences between two almost identical Klebsiella isolates (KN0056A-F and KN0056A-L) in the Pandora vcf output. For several regions in the reference Pandora is suggesting that one isolate has zero coverage (so absent) while the other is present. However, when I check this gene in the de novo assemblies I find it in both isolates (and with zero differences between them).

This is one example:

##contig=<ID=Cluster_560>
Cluster_560     1       .       CGTA    CGTG    .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     1       .       CGTAAAGCACCTCGACGCCATTCAGAATTTCGGCGCGATGGACATCCTCTGCACCGATAAAACCGGCACCTTGACCCAGGATAAGATTGTGCTGGAGAACCATACCGACGTCTCCGGCAAGGTCAGCGAGCGCGTACTGCATGCCGCTTGGCTGAACAGCCACTACCAGACCGGCCTGAAAAATCTGCTCGACACCGCGGTGCTGGACGGGGTTGAGCTGGATGCCGCCCGCGGGCTGGCGGCGCGCTGGCAGAAAGTGGATGAGATCCCCTTCGATTTCGAACGCCGCCGCATGTCGGTGGTGGTGAAAGAGGAGGACGCCGCGCATCAGCTGATCTGCAAAGGGGCGCTGCAGGAGATCCTCAACGTCTCGACCCAGGTGCGCTACAACGGCGATATCGTACCGCTGGACGACACCATGCTGCGCCGCATTCGCCGGGTGACCGATACCCTCAACCGACAGGGGCTACGGGTGGTGGCGGTGGCGACCAAATACCTGCCGGCCCGCGAAGGCGACTACCAGCGCGCCGATGAGTCGGACCTGATCCTTGAAGG     CGTGAAGCACCTCGACGCCATTCAGAATTTCGGCGCGATGGACATCCTCTGCACTGATAAAACCGGCACCCTGACCCAGGATAAGATTGTGCTGGAGAACCATACCGACGTCTCCGGCAAGGTCTGCGAGCGGGTACTGCATGCCGCCTGGCTCAACAGCCACTACCAGACCGGCCTGAAAAACCTGCTCGACACCGCGGTGCTGGACGGGGTTGAGCTGGATGCCGCCCGCGGGCTGGCGGAACGCTGGCAGAAGGTGGATGAGATACCCTTCGACTTCGAACGCCGCCGCATGTCGGTGGTGGTGAAGGAGGATGACGCCGCGCATCAGCTGATCTGCAAAGGGGCGCTGCAGGAGATCCTCAACGTCTCGACCCAGGTGCGCTACAACGGCGATATCGTACCTCTGGATGACACCATGCTGCGCCGCATTCGCCGGGTGACCGATACCCTCAACCGGCAGGGGCTGCGGGTGGTGGCGGTGGCGACTAAATACCTGCCGGCCCGCGAAGGCGACTACCAGCGCGCCGATGAGTCGGACCTGATCCTTGAAGGTTACATCGCCTTCCTCGATCCGCCGAAAGAGACCACCGCCCCGGCGCTGAAGGCGCTGAAGGCCAGCGGCATCACGGTGAAGATCCTCACCGGCGACAGCGAGCTGGTGGCGGCGAAGGTGTGCCATGAAGTGGGACTGGATGCTGGCGAAGTGGTGATTGGCAGCCAGATCGAAGCCATGAGCGACGACGAACTGGCGGCGCTGGCCAAACGCACCACGCTGTTCGCCCGCCTGGCGCCGCTGCATAAAGAGCGTATCGTGACGCTGCTCAAGCGTGAAGGTCACGTGGTGGGCTTTATGGGCGACGGCATCAACGACGCCCCGGCGCTGCGCGCGGCGGATATCG,GTCGGACCTGATCCTTGAAGGTTACATCGCCTTCCTCGATCCGCCGAAAGAGACCACCGCCCCGGCGCTGAAGGCGCTGAAGGCCAGCGGCATCACG        .       .       SVTYPE=COMPLEX;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:2,0,0:1,0,0:0,0,0:0,0,0:140,66,3:98,31,0:0.693548,0.879121,1:-42.3946,-70.1891,-73.8155:27.7945       .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560     28      .       TTTC    CTTT    .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     43      .       C       T       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     55      .       C       T       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     71      .       TTGACC  CTGACC,CTGACT   .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-60,-60,-60:0       .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560     125     .       A       T       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     133     .       C       G       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     148     .       TTGGCTG CTGGCTC,CTGGCTT .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-60,-60,-60:0       .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560     175     .       CC      GT      .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     184     .       T       C       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     208     .       CGGG    GGGC    .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     223     .       T       G       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     243     .       CG      AA      .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     256     .       AGTG    GGTA,GGTG       .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-60,-60,-60:0       .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560     268     .       C       A       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     277     .       T       C       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     310     .       AGAGGAG GGAGGAT .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,2:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     340     .       C       T       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:5,0:3,0:5,0:3,0:10,0:6,0:0,1:-13.395,-96.8414:83.4463 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     373     .       G       A       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:6,0:5,0:9,0:7,0:19,0:15,0:0.333333,1:-20.0891,-110.657:90.5677        .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     403     .       A       G       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:12,0:10,0:12,0:10,0:24,0:20,0:0,1:-3.64484,-161.314:157.669   .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     430     .       CATTCGC CATCCGT,TATTCGC .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:10,0,0:9,0,0:11,0,0:10,0,0:32,0,0:28,0,0:0,1,1:-4.71713,-147.498,-147.498:142.781     .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560     430     .       CATTCGCCGGGTGACCGATACCCTCAACCGACAGGGGCTA        CATTCGCCGGGTCACTGACACCCTCAACCGGCAAGGGCTG,CATTCGCCGGGTGACCGATACCCTCAACCGTCAGGGACTG,CATTCGCCGGGTGACCGATACCCTGAACCGTCAGGGACTG      .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:8,1,5,4:6,1,4,4:9,0,4,0:8,0,4,0:66,12,32,32:54,10,28,28:0.125,0.875,0.5,0.571429:-98.8227,-192.901,-137.715,-145.667:38.8924  .:0,0,0,0:0,0,0,0:0,0,0,0:0,0,0,0:0,0,0,0:0,0,0,0:1,1,1,1:-88,-88,-88,-88:0
Cluster_560     430     .       CATTCGCCGGGTGACCGATACCCTCAACCGACAGGGGCTACGGGTGGTGGCGGTGGCGACCAAATACCTGCCGGCCCGCGAAGGCGACTACCAGCGCGCCGATGAGTCGGACCTGATCCTTGAAGG  CATCCGTCGGGTGACCGATACCCTCAACCGACAGGGG,CATCCGTCGGGTGACCGATACCCTCAACCGGCAGGGG,CATCCGTCGGGTGACCGATAGCCTCAACCGACAGGGG,CATCCGTCGGGTGACCGATAGCCTCAACCGGCAGGGG,CATTCGCCGGGTGACCGATACCCTCAACCGACAGGGG,CATTCGCCGGGTGACCGATACCCTCAACCGGCAGGGG,CATTCGCCGGGTGACCGATAGCCTCAACCGACAGGGG,CATTCGCCGGGTGACCGATAGCCTCAACCGGCAGGGG,TATTCGCCGGGTGACCGATACCCTCAACCGACAGGGG,TATTCGCCGGGTGACCGATACCCTCAACCGGCAGGGG,TATTCGCCGGGTGACCGATAGCCTCAACCGACAGGGG,TATTCGCCGGGTGACCGATAGCCTCAACCGGCAGGGG .       .       SVTYPE=COMPLEX;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      5:5,0,0,0,0,11,7,7,3,0,0,0,0:3,0,0,0,0,10,6,6,3,0,0,0,0:5,0,0,0,0,11,11,11,0,0,0,0,0:3,0,0,0,0,10,10,10,0,0,0,0,0:87,0,0,0,0,23,23,23,23,0,0,0,0:57,0,0,0,0,20,20,20,20,0,0,0,0:0.133333,1,1,1,1,0,0.333333,0.333333,0.666667,1,1,1,1:-261.469,-340.915,-340.915,-340.915,-340.915,-188.162,-239.385,-239.385,-289.456,-340.915,-340.915,-340.915,-340.915:51.223       .:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0,0,0,0:1,1,1,1,1,1,1,1,1,1,1,1,1:-88,-88,-88,-88,-88,-88,-88,-88,-88,-88,-88,-88,-88:0
Cluster_560     460     .       A       G       .       .       SVTYPE=SNP;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:6,0:5,0:8,0:6,0:27,0:21,0:0.25,1:-17.5891,-110.657:93.0677    .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     460     .       ACAGGGGCTA      GCAGGGACTG      .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:6,0:5,0:8,0:5,0:34,0:26,0:0.2,1:-16.0891,-110.657:94.5677     .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0
Cluster_560     490     .       CAAA    CAAG,TAAA       .       .       SVTYPE=PH_SNPs;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:4,1,1:2,1,0:4,0,0:1,0,0:27,8,4:13,5,0:0.166667,0.8,0.75:-34.9876,-80.1269,-85.9402:45.1394    .:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:0,0,0:1,1,1:-88,-88,-88:0
Cluster_560     555     .       G       GGCG    .       .       SVTYPE=INDEL;GRAPHTYPE=SIMPLE   GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-60,-60:0 .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-88,-88:0

Thoughts?

De novo assemblies here: /hps/nobackup2/iqbal/projects/pandora/klebs/neonate/data/KpST17_Norway_20190617/contigs/patient-pairs

Pandora output here: /hps/nobackup/iqbal/leandro/klebs_neonate_leah/pandora_compare_results

leoisl commented 4 years ago

Thanks for the description! I think the best for this case is to go into debug mode and understand why we have this drop of coverage.

Cheers.

leoisl commented 4 years ago

For this issue and https://github.com/rmcolq/pandora/issues/209 , before diving into debugging, I was wondering if we could get the expected results by changing some parameters. When using --illumina parameter, the error rate gets defaulted to 0.001 so it could be too low. Increased to 0.01, but there was no effect on these two genes.

Any other parameterization ideas before diving into debugging? Note that this is strictly a mapping issue. Also worth noting that Leah noticed this issue in many genes, it is the main issue she has right now. I am very interested in this issue because it seems we are undermapping reads (not sure if this is also true for ONT reads), and thus we are making less calls than we could have. It seems to me that fixing this could push our recall in the 4-way analysis way up.

What do you think?

rmcolq commented 4 years ago

This doesn't look like a simple bug to me, and as you say is likely some combination of parameter/algorithm effects. Worth noting that I have got some built in overrides which mean that your command line error rate is not allowed to be higher than 0.1 with the --illumina flag. The --min_cluster_size is likely to make more of a difference for increasing detection of genes, but will also increase FPs.

I also think this could take weeks to debug, and lead to even more code changes, so I'm keen to get your existing stuff merged in first and the results we need. I think improving our overall recall in the 4-way is an optimization for later.