luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Polyclone mode, VCF output, phasing #90

Closed mherold1 closed 4 years ago

mherold1 commented 4 years ago

Hello,

thank you for your efforts in providing this tool. I have a question that I could not solve from the extensive documentation and thought maybe you could help me out.

My use case: I have mixtures of bacterial strains and I would like to discriminate the strains using sequencing data for several marker genes that have been amplified. I thought I could use this tool (with the -C polyclone -P 1 option) to identify variants on these marker genes and reconstruct which alleles are likely to be present in the mixture. The examples shown below are from a test sample with a known mix of 3 strains, where strains can share the same allele (example1) or have distinct alleles (example2) of a marker gene. This is the command I used for the examples:

octopus -R ref.fa -I sample.bam -C polyclone -P 1 \
--threads 3 --snp-heterozygosity 0.1 -o octopus.sample.vcf \
--allow-octopus-duplicates --allow-secondary-alignments --allow-supplementary-alignments \
--dont-protect-reference-haplotype  --bamout octopus.sample.bam

My questions: I have tested different parameter settings, but I am still not sure which ones would be best suited to improve results, as overall predictions work but are not really accurate. What parameters do I need to set to discriminate between highly similar alleles and enforce strict read phasing, --phasing-level conservative? How can I extract the inferred haplotypes from the vcf or would I need to assemble reads in the bam-output with the same "hi" tag?

I am also not sure if I understand the vcf output correctly. I found a previous issue that helped me a bit (https://github.com/luntergroup/octopus/issues/32) but this was not specific for ploidy 1. I assume that from the GT field I can reconstruct the genotypes as the order is consistent within the same phase group. So in the following case from the output vcf (example1):

gene1    62      .       A       G       3076.53 PASS    AC=1;AN=2;DP=1137;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|1:3077:1137:255:62:100:PASS
gene1    158     .       G       A       3076.53 PASS    AC=1;AN=2;DP=1509;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|1:3077:1509:255:62:100:PASS
gene1    162     .       C       T       3076.53 PASS    AC=1;AN=2;DP=1502;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|0:3077:1502:255:62:100:PASS
gene1    419     .       T       C       3076.53 PASS    AC=1;AN=1;DP=1069;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1:15:1069:255:419:100:PASS
gene1    434     .       C       T       3076.53 PASS    AC=1;AN=1;DP=1020;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1:15:1020:255:419:100:PASS

There is phase group 62 (first 3 SNVs:62, 158, 162) and PS 419 with 2 SNVs (419, 434) that would give me 2 distinct alleles regarding these 5 variants: allele1: 1-1-0-1-1 allele2: 0-0-1-1-1 This is already a bit confusing for the haplotype reconstruction as only one of the predicted haplotypes (green) has the last 2 SNVs (the lower 2 tracks are simulated reads for the correct alleles): igv_octopus_mix_exampel1

A more complex example with a different gene (with 3 distinct alleles):

gene2    71      .       T       C       1909.05 PASS    AC=1;AN=3;DP=233;MQ=37;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    0|1|0:140:233:37:53:100:PASS
gene2    89      .       A       G       3076.53 PASS    AC=1;AN=3;DP=320;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:140:320:38:53:100:PASS
gene2    92      .       T       C       3076.53 PASS    AC=1;AN=3;DP=330;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:119:330:38:53:100:PASS
gene2    101     .       C       T       3076.53 PASS    AC=1;AN=3;DP=366;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:140:366:38:53:100:PASS
gene2    107     .       T       C       3076.53 PASS    AC=1;AN=3;DP=383;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:3077:383:38:107:100:PASS
gene2    119     .       C       T       3076.53 PASS    AC=2;AN=3;DP=479;MQ=39;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    0|1|1:3077:479:39:107:100:PASS
gene2    179     .       G       A       3076.53 PASS    AC=1;AN=3;DP=729;MQ=39;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:3077:729:39:179:100:PASS
gene2    251     .       T       C       3076.53 PASS    AC=2;AN=3;DP=890;MQ=39;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|0:3077:890:39:251:100:PASS
gene2    266     .       T       C       3076.53 PASS    AC=1;AN=3;DP=922;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    0|0|1:3077:922:38:251:100:PASS
gene2    281     .       T       C       3076.53 PASS    AC=2;AN=3;DP=857;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|0:3077:857:38:251:100:PASS
gene2    287     .       A       T       3076.53 PASS    AC=2;AN=3;DP=868;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|0:3077:868:38:251:100:PASS
gene2    299     .       T       C       3076.53 PASS    AC=2;AN=3;DP=851;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|0:3077:851:38:299:100:PASS
gene2    308     .       G       A       3076.53 PASS    AC=2;AN=3;DP=838;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|0:3077:838:38:299:100:PASS
gene2    314     .       T       C,G     1939.43 PASS    AC=1,1;AN=3;DP=869;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|2|0:1939:869:38:299:100:PASS
gene2    317     .       G       A       1912.95 PASS    AC=1;AN=3;DP=905;MQ=38;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:1913:905:38:299:100:PASS
gene2    335     .       A       G       3076.53 PASS    AC=2;AN=3;DP=950;MQ=39;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|0:1914:950:39:299:100:PASS
gene2    336     .       C       T       3076.53 PASS    AC=1;AN=3;DP=947;MQ=39;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    0|0|1:1918:947:39:299:100:PASS
gene2    350     .       A       G       1733.3  PASS    AC=1;AN=3;DP=907;MQ=39;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:1733:907:39:299:100:PASS
gene2    380     .       T       C       3076.53 PASS    AC=1;AN=3;DP=816;MQ=40;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    0|0|1:1583:816:40:299:100:PASS
gene2    402     .       A       G       1555.79 PASS    AC=1;AN=3;DP=817;MQ=40;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|0|0:1556:817:40:299:100:PASS
gene2    440     .       G       C       3076.53 PASS    AC=2;AN=3;DP=852;MQ=41;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|0:3077:852:41:440:100:PASS
gene2    554     .       C       T       3076.53 PASS    AC=2;AN=3;DP=229;MQ=40;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    0|1|1:529:229:40:440:100:PASS

Here I have several phase groups and tried to "reconstruct" alleles similarly as in example1, in square brackets I noted to which correct phenotype (simulated read tracks) a combination corresponded to: first 5 SNVs in PS 53: 1-0-1-1-1 [1st, green] 1-0-1-0-0 (wrong) 0-1-0-0-0 [2nd and 3rd, blue in the beginning] 2 SNVs in PS 107: 1-0 [1st]
1-0 [1st]
0-1 [2nd and 3rd] 9 SNVs in PS 179: 1-0-1-0-0-0-0-0-0 [1st] 0-1-0-1-1-1-1-1-1 [3rd] 0-1-0-1-1-1-1-2-0 [2nd] 5 SNVs in PS 335: 0-1-0-1-0 [1st] 1-0-0-0-0 [2nd] 1-0-1-0-1 [3rd] last SNVs 2 in PS 440: 1-1 [1st] 0-0 [2nd and 3rd] 0-1 (wrong, but there are actually several reads with this combination)

igv_octopus_mix_example2

Why are the same genotypes within a phase group collapsed sometimes (example1, PS:419) but not for others (example2 PS:107)?

output of octopus --version:

octopus version 0.6.3-beta (develop 110c2115)
Target: x86_64 Linux 4.15.0-65-generic
SIMD extension: AVX2
Compiler: GNU 7.4.0
Boost: 1_65_1

Thanks for your help.

dancooke commented 4 years ago

Just to clarify some terminology, as you appear to be using "allele" and "haplotype" inconsistently, when referring to variant calls:

We can only call haplotypes if we can correctly phase alleles at heterozygous variant sites. It's therefore possible to correctly infer alleles and incorrectly infer haplotypes (i.e. phase errors), but not vice versa.

Your interpretation of the phase sets is correct. However, I don't quite understand your comment

This is already a bit confusing for the haplotype reconstruction as only one of the predicted haplotypes (green) has the last 2 SNVs (the lower 2 tracks are simulated reads for the correct alleles)

If I'm reading these pileups correctly (each track corresponds to a strain?), then it appears to me that Octopus has called this region correctly, however, it's certainly possible that the following could have called instead

gene1    62      .       A       G       3076.53 PASS    AC=1;AN=2;DP=1137;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|1:3077:1137:255:62:100:PASS
gene1    158     .       G       A       3076.53 PASS    AC=1;AN=2;DP=1509;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|1:3077:1509:255:62:100:PASS
gene1    162     .       C       T       3076.53 PASS    AC=1;AN=2;DP=1502;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|0:3077:1502:255:62:100:PASS
gene1    419     .       T       C       3076.53 PASS    AC=1;AN=1;DP=1069;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1:15:1069:255:62:100:PASS
gene1    434     .       C       T       3076.53 PASS    AC=1;AN=1;DP=1020;MQ=255;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1:15:1020:255:62:100:PASS

The reason that it wasn't called this way is due to default heuristics in the algorithm that prevent attempting to construct haplotypes where there are clearly no reads able to inform phasing. This heuristic is applied before the downstream (w.r.t the current calling window) sites have been called, so it escapes the algorithm that these sites are actually homozygous and could have been included. You can possibly achieve the calls I proposed by adding the --backtrack-level AGGRESSIVE option, as this will force the algorithm to try to join previously 'called' haplotypes.

I'm struggling to parse your second example. I don't see follow the calls Octopus makes with what I'm seeing on the pileups, plus Octopus is only calling two phase sets here, not 5 as you suggest. Did you post the right example?

Coming back to your question...

I have tested different parameter settings, but I am still not sure which ones would be best suited to improve results, as overall predictions work but are not really accurate. What parameters do I need to set to discriminate between highly similar alleles and enforce strict read phasing, --phasing-level conservative?

What exactly do you mean by "not really accurate", are you seeing false positive or negative variant calls, or do you believe that the phase calls are incorrect? I don't understand what you mean by "to discriminate between highly similar alleles and enforce strict read phasing", what do you mean by "similar" and "strict read phasing"? Octopus phases alleles using genotype posteriors, which are informed by genotype likelihoods (i.e. read information) and prior information, but the prior is only weakly informative for the polyclone model, so is unlikely to outweigh the information from the genotype likelihoods. If you simply want longer phase sets then there are numerous things to try:

  1. Set --lagging-level AGGRESSIVE
  2. Set --backtrack-level AGGRESSIVE
  3. Set --extension-level AGGRESSIVE
  4. Set --max-haplotypes 400, or an even larger value (default is 200).
  5. Set --min-phase-score 5, or lower (default is 10).

But ultimately the phase length will be restricted by the information provided by the reads.

How can I extract the inferred haplotypes from the vcf or would I need to assemble reads in the bam-output with the same "hi" tag?

In what format do you want the inferred haplotypes (i.e. what do you mean by "extract")? If you simply want the sequence then you could try using something like bcftools consensus, but you'll need to manually split the genotypes up into composing haplotypes. It wouldn't make sense to try to "assemble" reads using the hi tag in this way since this is precisely what Octopus has already done. The hi tag tell you which haplotype Octopus thinks the read originated from (if it is not ambiguous).

mherold1 commented 4 years ago

Thanks for the quick reply!

Just to clarify some terminology, as you appear to be using "allele" and "haplotype" inconsistently

Sorry for the confusion, I was referring to the complete gene by "allele", so a combination of several variants. So I should refer to this as haplotype? So a phase set would be a combination of variants and alleles that can be linked through co-occurence on reads?

If I'm reading these pileups correctly (each track corresponds to a strain?)

In the pileups the first track is sequencing data (i.e. the mix of strains) colored by hi tag (so the octopus output), the following tracks are simulated pileups corresponding to the correct haplotypes (?) of the strains that are present (so the ground truth for the haplotypes that I would like to reconstruct)

then it appears to me that Octopus has called this region correctly,

Yes for example1 the region has been called correctly except for the homozygous variants no. 4,5 in the 3 strains as you mentioned. I tried adding the backtrack-level option. Leading to the following result:

gene1    62      .       A       G       3076.53 PASS    AC=1;AN=3;DP=637;MQ=41;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    0|0|1:3077:637:41:62:100:PASS
gene1    158     .       G       A       3076.53 PASS    AC=2;AN=3;DP=1349;MQ=41;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|0|1:3077:1349:41:62:100:PASS
gene1    162     .       C       T       3076.53 PASS    AC=1;AN=3;DP=1340;MQ=41;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|1|0:3077:1340:41:62:100:PASS
gene1    419     .       T       C       3076.53 PASS    AC=3;AN=3;DP=984;MQ=41;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|1:3077:984:41:62:100:PASS
gene1    434     .       C       T       3076.53 PASS    AC=3;AN=3;DP=958;MQ=41;NS=1     GT:GQ:DP:MQ:PS:PQ:FT    1|1|1:3040:958:41:62:100:PASS

3 genotypes: 0-1-0-1-1 (false positive) 0-0-1-1-1 (correct) 1-1-0-1-1 (correct)

In the bam output I see 4 different hi-tags

I'm struggling to parse your second example. I don't see follow the calls Octopus makes with what I'm seeing on the pileups, plus Octopus is only calling two phase sets here, not 5 as you suggest. Did you post the right example?

Sorry, I indeed posted the wrong vcf excerpt for example2, I edited the original post.

What exactly do you mean by "not really accurate", are you seeing false positive or negative variant calls, or do you believe that the phase calls are incorrect? I don't understand what you mean by "to discriminate between highly similar alleles and enforce strict read phasing", what do you mean by "similar" and "strict read phasing"?

The variant calls are accurate except for the genotype information, where I often see false positive, sometimes false negative genotypes inferred (so number of genotype and order of alleles within phase sets). I would like to have longer phase sets but also be more stringent as in genotypes/haplotypes are only inferred if the same alleles appear together on a large number of reads. However, then these alleles should be used to discriminate between haplotypes stringently. In this example when reads cover and have the same allele on 2 variants they inferred to come from "haplotype brown", but when they cover only 1 of these 2 variants they are assigned to blue: example1

While it is not a great example, e.g., in the pileup for example2 (see original post) there is an inconsistency for the "blue haplotype", where some of the reads in the middle do not share the same allele at SNV-pos179 (no. 7).

I will provide a few more examples as I test different settings.

In what format do you want the inferred haplotypes (i.e. what do you mean by "extract")

I meant extracting variant calls with the alleles corresponding to each haplotype. For example 1 this would be straight-forward simply looking at the GT field as the order is consistent. However, for example2 I could not know the order in which alleles in the GT field belong together across different phase sets. This seems to be different for "hi" fields that link reads with specific alleles across these phase sets, but not always correctly (e.g. for example2 "green hi" on the left and "red hi" in the middle part should be the same haplotype).

dancooke commented 4 years ago

I was referring to the complete gene by "allele", so a combination of several variants. So I should refer to this as haplotype?

Yes, well actually a combination of alleles, which are instances of a variants.

So a phase set would be a combination of variants and alleles that can be linked through co-occurence on reads?

You can view it this way, although the truth is a little more complicated since you're ignoring prior information that could be used to determine the phasing of alleles. The prior model used by the polyclone model is relatively uninformative (compared with read information) in this regard. However, other prior models can be more informative and even be sufficient to determine phase (e.g. the trio prior model, since this essentially phases by decent).

Yes for example1 the region has been called correctly except for the homozygous variants no. 4,5 in the 3 strains as you mentioned. I tried adding the backtrack-level option. Leading to the following result:

The two homozygous variants are not called incorrectly, it's just the algorithm didn't include them in the same phase set as the other variants. It's important to be aware that the assertions about the number of haplotypes inferred is local to the phase sets; you can view phase sets as independent inferences. So although the algorithm inferred two haplotypes in one phase set, it's still perfectly valid that only one can be found in another. The maximum number of haplotypes inferred (over all phase sets) provides a lower bound on the number of (inferred) unique strains in your sample. However, only if the phase set covered the entire contig could you make a claim about the actual number.

3 genotypes: 0-1-0-1-1 (false positive) 0-0-1-1-1 (correct) 1-1-0-1-1 (correct)

On closer inspection if your pileups, this actually does appear correct. In addition to the two simulated strains, I can see two reads in the mixture (top and second bottom) that appear to support (0-1-0-1-1). How did you generate these simulated reads?

Sorry, I indeed posted the wrong vcf excerpt for example2, I edited the original post.

I can see how these match up with the pileups now but not your breakdown...

first 5 SNVs in PS 53: 1-0-1-1-1 [1st, green] 1-0-1-0-0 (wrong) 0-1-0-0-0 [2nd and 3rd, blue in the beginning]

That's not what is reported in the VCF, in this notation it would be

0-1-1-1-1 [1st] 1-0-0-0-0 [2nd] 0-0-0-0-0 [3rd]

I can't tell if this is correct or not since your mixed reads don't cover the first variant. However, it appears to me that you have strong read support in the mixture for haplotypes not present any of the "truth" strains. For example, there are at least 12 reads supporting a haplotype with [89:A, 92:T, 101:T, 107:C] (i.e. [REF, REF, ALT, ALT]), but this haplotype doesn't appear in any strain. The closest would be [89:G, 92:C, 101:T, 107:C] (i.e. [ALT, ALT, ALT, ALT], the 'green' haplotype). This - and the unexpected reads in example 1 - strongly suggests to me something is going wrong in your simulated mixture, which is probably confusing Octopus. Note that the maximum number of called haplotypes is capped by --max-clones, so if you actually have more haplotypes than this in your data then your results will be wrong. In principle, the default (3) should be sufficient for your purpose, but I strongly suspect that you've inadvertently simulated more haplotypes than this.

In this example when reads cover and have the same allele on 2 variants they inferred to come from "haplotype brown", but when they cover only 1 of these 2 variants they are assigned to blue

Most likely is that the haplotype index for blue is used denote an ambiguous assignment (i.e. two or more of the called haplotype are equally probable). This makes sense if you look at the pileup, the brown reads cover the A SNV and the G SNV, but the blue reads only manage to cover the A SNV, I'd bet that there was a called haplotype that included the reference allele at the A SNV + the G SNV (although there are no reads supporting this haplotype in this pileup), and therefore the likelihood of these two haplotypes would be equal for the blue reads.

I could not know the order in which alleles in the GT field belong together across different phase sets.

This is exactly the purpose of phase sets, to inform you when you can and can't assume consistent ordering. If you want to be able to phase these variants then you'll tune the algorithm to achieve larger phase lengths, but ultimately you will be limited by your experimental design, in which case you'll need to consider alternatives, such as. longer or linked reads (e.g. 10X).

One more point, Octopus can currently only report fully phased phase sets, it's possible that partial phasing could be achieved in cases of three or more haplotypes but not full phasing. This is something I'm hoping to address over the next few months.

mherold1 commented 4 years ago

3 genotypes: 0-1-0-1-1 (false positive) 0-0-1-1-1 (correct) 1-1-0-1-1 (correct)

On closer inspection if your pileups, this actually does appear correct. In addition to the two simulated strains, I can see two reads in the mixture (top and second bottom) that appear to support (0-1-0-1-1). How did you generate these simulated reads?

The reads in the hi-colored pileups are not simulated, so they could include sequencing reads that actually support these "incorrect" inferences e.g. because of sequencing errors. The other pileups are simulated reads that should merely illustrate which variants to expect for the correct haplotypes.

One aspect I would like to improve here is to make the genotype inference more stringent here so that these reads do not lead to a false positive prediction. I should have linked to the full pileups instead of adding screenshots with which I could not show the full pileups. So these reads that support these seemingly wrong predictions are very few in the full context. I think another aspect that is important here is more is also a more stringent read pre-processing as these "errors" seem to often appear in the end and beginning of the reads

I can see how these match up with the pileups now but not your breakdown... That's not what is reported in the VCF, in this notation it would be 0-1-1-1-1 [1st] 1-0-0-0-0 [2nd] 0-0-0-0-0 [3rd]

I might have overwritten the VCF with a different run by accident, so you are correct that my breakdown does not match the VCF output. However, 0-0-0-0-0 should also not be detected.

dancooke commented 4 years ago

The reads in the hi-colored pileups are not simulated, so they could include sequencing reads that actually support these "incorrect" inferences e.g. because of sequencing errors.

Can you provide a BAM file of the mixture for these two genes? I'm happy to take a closer look, but my prior on two independent reference-reverting sequencing errors occurring in otherwise clean reads is pretty small.

mherold1 commented 4 years ago

Can I send the files via email?

dancooke commented 4 years ago

Sure, if it's small enough to get past the email servers... dcooke@well.ox.ac.uk.

dancooke commented 4 years ago

Ok, I've had a look at the BAM and I'm pretty happy with what Octopus is doing here. There is very strong read support for the haplotypes that you claim should not be present.

Looking at the first example (glnA), Octopus calls the following:

$ octopus -R all_bactag.fa -I MLST-2T-Spiked-Mix.sorted.bam -C polyclone -T glnA --max-clones 5 --disable-downsampling -o glnA.vcf --bamout glnA.bam --backtrack AGGRESSIVE --max-haplotypes 400
$ bcftools view -H glnA.vcf
glnA    62  .   A   G   33776   PASS    AC=2;AN=4;DP=680;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT    0|0|1|1:10000:680:41:62:100:PASS
glnA    158 .   G   A   54696.8 PASS    AC=2;AN=4;DP=3167;MQ=41;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|0:10000:3167:41:62:100:PASS
glnA    162 .   C   T   30438.2 PASS    AC=2;AN=4;DP=3203;MQ=41;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|1|0|1:10000:3203:41:62:100:PASS
glnA    419 .   T   C   100000  PASS    AC=4;AN=4;DP=3075;MQ=41;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1:10000:3075:41:62:100:PASS
glnA    434 .   C   T   100000  PASS    AC=4;AN=4;DP=2720;MQ=41;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1:10000:2720:41:62:100:PASS

So, Octopus is claiming 4 haplotypes. Now lets look at the evidence BAM. To make it a bit easier to show the different haplotypes I'll split the evidence BAM by haplotype support using the bundled Python script:

$ split_realigned_bam.py -b glnA.bam -o glnA.split

The resulting split BAMs look like:

glnA

Note that the last track contains the ambiguous reads. The approximate depth for each of the first four tracts is 180x, 715x, 500x, and 80x. This is pretty convincing imho.

Now turning to glyA:

$ octopus -R all_bactag.fa -I MLST-2T-Spiked-Mix.sorted.bam -C polyclone -T glyA --max-clones 12 --disable-downsampling -o glnA.vcf --bamout glnA.bam --backtrack AGGRESSIVE --max-haplotypes 400
$ split_realigned_bam.py -b glyA.bam -o glyA.split
$ bcftools view -H glyA.vcf
glyA    53  .   T   C   26086.5 PASS    AC=4;AN=10;DP=145;MQ=36;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|0|0|0|0|0|0:968:145:36:53:100:PASS
glyA    71  .   T   C   22873.8 PASS    AC=6;AN=10;DP=279;MQ=37;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|0|1|1|1|1|1|1:662:279:37:53:100:PASS
glyA    89  .   A   G   26086.5 PASS    AC=4;AN=10;DP=373;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|0|0|0|0|0|0:968:373:38:53:100:PASS
glyA    92  .   T   C   26086.5 PASS    AC=4;AN=10;DP=384;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|0|0|0|0|0|0:968:384:38:53:100:PASS
glyA    101 .   C   T   26086.5 PASS    AC=5;AN=10;DP=420;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|0|0|0|0|0|1:662:420:38:53:100:PASS
glyA    107 .   T   C   26086.5 PASS    AC=5;AN=10;DP=439;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|0|0|0|0|0|1:662:439:38:53:100:PASS
glyA    119 .   C   T   22873.8 PASS    AC=5;AN=10;DP=574;MQ=39;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|0|1|1|1|1|1|0:968:574:39:53:100:PASS
glyA    179 .   G   A   25642.5 PASS    AC=4;AN=10;DP=819;MQ=39;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|0|0|0|0|0|0|1:620:819:39:53:100:PASS
glyA    251 .   T   C   22873.8 PASS    AC=5;AN=10;DP=962;MQ=39;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|1|1|1|1|1|0|0:6093:962:39:53:100:PASS
glyA    266 .   T   C   39313.4 PASS    AC=5;AN=10;DP=993;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|0|0|0|0|0|1|1:1381:993:38:53:100:PASS
glyA    281 .   T   C   22873.8 PASS    AC=5;AN=10;DP=912;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|1|1|1|1|1|0|0:1383:912:38:53:100:PASS
glyA    287 .   A   T   22873.8 PASS    AC=5;AN=10;DP=927;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|1|1|1|1|1|0|0:1381:927:38:53:100:PASS
glyA    299 .   T   C   22873.8 PASS    AC=5;AN=10;DP=914;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|1|1|1|1|1|0|0:5320:914:38:53:100:PASS
glyA    308 .   G   A   22873.8 PASS    AC=5;AN=10;DP=903;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|1|1|1|1|1|0|0:5320:903:38:53:100:PASS
glyA    314 .   T   C,G 22873.8 PASS    AC=1,4;AN=10;DP=934;MQ=38;NS=1  GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|2|1|2|2|2|0|0:4838:934:38:53:100:PASS
glyA    317 .   G   A   22873.8 PASS    AC=1;AN=10;DP=979;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|0|1|0|0|0|0|0:4838:979:38:53:100:PASS
glyA    335 .   A   G   22873.8 PASS    AC=5;AN=10;DP=1047;MQ=39;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|0|1|1|1|0|1|1|0|0:1381:1047:39:53:100:PASS
glyA    336 .   C   T   39050.2 PASS    AC=5;AN=10;DP=1044;MQ=39;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1|0|0|0|1|0|0|1|1:1381:1044:39:53:100:PASS
glyA    350 .   A   G   22873.8 PASS    AC=1;AN=10;DP=1024;MQ=39;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|0|1|0|0|0|0|0:4838:1024:39:53:100:PASS
glyA    380 .   T   C   36434.6 PASS    AC=5;AN=10;DP=946;MQ=40;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|0|0|0|0|0|1|1:8069:946:40:53:100:PASS
glyA    402 .   A   G   22873.8 PASS    AC=1;AN=10;DP=955;MQ=40;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|0|1|0|0|0|0|0:4838:955:40:53:100:PASS
glyA    440 .   G   C   30368.2 PASS    AC=5;AN=10;DP=965;MQ=41;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|0|0|0|1|0|1|1:1335:965:41:53:100:PASS
glyA    554 .   C   T   24199   PASS    AC=6;AN=10;DP=234;MQ=40;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|1|1|1|0|1|0|1|1:502:234:40:53:100:PASS

Note that I set --max-clones 12 but only 10 haplotypes were inferred. The split BAMs look like

glyA

Once again, it's pretty convincing. There's no way this is all due to sequencing errors - something else is going on in your data that is resulting in more haplotypes than you expect.

I've made some improvements to the polyclone calling model since looking into this example - it should be faster now. I'd recommend checking out the latest develop branch.

dancooke commented 4 years ago

Having starred at the second set of pileups a bit longer it appeared that there are some more haplotypes in this data (i.e. more than 10). I found a bug that was causing Octopus to fail when the number of clones went above 10, which I've now fixed in b7a5a3e07a13c241ab7a5c545947d52ec38016c5. I've just re-run with --max-clones 14 and got calls with 14 haplotypes, so I would guess that there is at-least this number.

dancooke commented 4 years ago

It appears the magic number is 25!

$ octopus -R all_bactag.fa -I MLST-2T-Spiked-Mix.sorted.bam -C polyclone -T glyA --max-clones 30 --disable-downsampling -o glyA.vcf --bamout glyA.bam --backtrack AGGRESSIVE --max-haplotypes 1000 --max-genotypes 10000
bcftools view -H glyA.vcf
glyA    53  .   T   C   27850.7 PASS    AC=14;AN=25;DP=145;MQ=36;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:10000:145:36:53:100:PASS
glyA    71  .   T   C   24638   PASS    AC=12;AN=25;DP=279;MQ=37;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|0:3:279:37:53:100:PASS
glyA    89  .   A   G   27850.7 PASS    AC=12;AN=25;DP=373;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|0|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:198:373:38:53:100:PASS
glyA    92  .   T   C   27850.7 PASS    AC=12;AN=25;DP=384;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|0|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:198:384:38:53:100:PASS
glyA    101 .   C   T   27850.7 PASS    AC=14;AN=25;DP=420;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:10000:420:38:53:100:PASS
glyA    107 .   T   C   27850.7 PASS    AC=14;AN=25;DP=439;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:10000:439:38:53:100:PASS
glyA    119 .   C   T   24638   PASS    AC=10;AN=25;DP=574;MQ=39;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|0:10000:574:39:53:100:PASS
glyA    179 .   G   A   27850.7 PASS    AC=12;AN=25;DP=819;MQ=39;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|0|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:198:819:39:53:100:PASS
glyA    251 .   T   C   24638   PASS    AC=10;AN=25;DP=962;MQ=39;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|0|0|0:338:962:39:53:100:PASS
glyA    266 .   T   C   39208.1 PASS    AC=15;AN=25;DP=993;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|1|0|1|1|1|1|1|1|1|1|1|1|0|1|0|0|0|0|0|0|0|1|1|1:338:993:38:53:100:PASS
glyA    281 .   T   C   24638   PASS    AC=10;AN=25;DP=912;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|0|0|0|0|0|0|0|0|0|0|1|0|1|1|1|1|1|1|0|1|0|0:570:912:38:53:100:PASS
glyA    287 .   A   T   24638   PASS    AC=10;AN=25;DP=927;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|0|0|0|0|0|0|0|0|0|0|1|0|1|1|1|1|1|1|0|1|0|0:570:927:38:53:100:PASS
glyA    299 .   T   C   24638   PASS    AC=9;AN=25;DP=914;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|0|1|0|0:570:914:38:53:100:PASS
glyA    308 .   G   A   24638   PASS    AC=10;AN=25;DP=903;MQ=38;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|1|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|0|1|0|0:570:903:38:53:100:PASS
glyA    314 .   T   C,G 24638   PASS    AC=3,7;AN=25;DP=934;MQ=38;NS=1  GT:GQ:DP:MQ:PS:PQ:FT    2|0|2|2|0|0|0|0|0|0|0|0|0|0|0|1|1|1|2|2|2|0|2|0|0:289:934:38:53:100:PASS
glyA    317 .   G   A   24638   PASS    AC=3;AN=25;DP=979;MQ=38;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|0|0|0|0|0|0|0:289:979:38:53:100:PASS
glyA    335 .   A   G   24638   PASS    AC=11;AN=25;DP=1047;MQ=39;NS=1  GT:GQ:DP:MQ:PS:PQ:FT    1|0|1|1|0|0|0|0|0|0|0|1|1|0|0|1|1|1|0|1|1|0|1|0|0:391:1047:39:53:100:PASS
glyA    336 .   C   T   40814.4 PASS    AC=14;AN=25;DP=1044;MQ=39;NS=1  GT:GQ:DP:MQ:PS:PQ:FT    0|1|0|0|1|1|1|1|1|1|1|0|0|1|1|0|0|0|1|0|0|1|0|1|1:440:1044:39:53:100:PASS
glyA    350 .   A   G   24638   PASS    AC=4;AN=25;DP=1024;MQ=39;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|1|0|0|0|0|0|0|1|0|0|0|0|0|1|1|0|0|0|0|0|0|0:328:1024:39:53:100:PASS
glyA    380 .   T   C   38198.8 PASS    AC=12;AN=25;DP=946;MQ=40;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    0|1|0|0|1|1|1|1|0|0|0|1|0|1|1|0|0|0|1|0|0|1|0|1|1:311:946:40:53:100:PASS
glyA    402 .   A   G   24638   PASS    AC=6;AN=25;DP=955;MQ=40;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    0|0|0|1|0|0|0|1|0|1|1|0|0|0|0|0|1|1|0|0|0|0|0|0|0:464:955:40:53:100:PASS
glyA    440 .   G   C   32132.5 PASS    AC=11;AN=25;DP=965;MQ=41;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1|0|0|1|1|0|0|0|0|0|1|0|1|1|0|0|0|1|0|0|1|0|1|1:245:965:41:53:100:PASS
glyA    554 .   C   T   25963.2 PASS    AC=14;AN=25;DP=234;MQ=40;NS=1   GT:GQ:DP:MQ:PS:PQ:FT    1|1|1|0|0|1|0|1|0|0|0|1|0|1|1|0|0|1|0|0|1|1|1|1|1:51:234:40:53:100:PASS
mherold1 commented 4 years ago

For the glnA example I agree with your assessment that there is strong read support, this should not be a coincidence and I will try to find out what has happened here. However it seems that the "correct" haplotypes have a much higher depth than the others.

For the glyA example I have a hard time seeing where reads overlap but there seems to be at least some examples were the evidence for the haplotypes seems to be relying on few overlapping reads. In your example with 10 haplotypes the "correct" haplotypes are _5, _6, and _9. Also here it seems that quite some of the correct reads end up in _10, which I assume is the unassignable fraction.

The most likely explanation that comes to mind would be that the strains used in the test sample have not been pure as assumed, I will look into this again.

Thanks a lot again for looking into this and the impressively quick response time. Feel free to close this issue.