Closed mherold1 closed 4 years ago
Just to clarify some terminology, as you appear to be using "allele" and "haplotype" inconsistently, when referring to variant calls:
REF
or ALT
column of a single VCF record (i.e. a variant site). The site genotype (GT
) is composed of alleles.We can only call haplotypes if we can correctly phase alleles at heterozygous variant sites. It's therefore possible to correctly infer alleles and incorrectly infer haplotypes (i.e. phase errors), but not vice versa.
Your interpretation of the phase sets is correct. However, I don't quite understand your comment
This is already a bit confusing for the haplotype reconstruction as only one of the predicted haplotypes (green) has the last 2 SNVs (the lower 2 tracks are simulated reads for the correct alleles)
If I'm reading these pileups correctly (each track corresponds to a strain?), then it appears to me that Octopus has called this region correctly, however, it's certainly possible that the following could have called instead
gene1 62 . A G 3076.53 PASS AC=1;AN=2;DP=1137;MQ=255;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|1:3077:1137:255:62:100:PASS
gene1 158 . G A 3076.53 PASS AC=1;AN=2;DP=1509;MQ=255;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|1:3077:1509:255:62:100:PASS
gene1 162 . C T 3076.53 PASS AC=1;AN=2;DP=1502;MQ=255;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0:3077:1502:255:62:100:PASS
gene1 419 . T C 3076.53 PASS AC=1;AN=1;DP=1069;MQ=255;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1:15:1069:255:62:100:PASS
gene1 434 . C T 3076.53 PASS AC=1;AN=1;DP=1020;MQ=255;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1:15:1020:255:62:100:PASS
The reason that it wasn't called this way is due to default heuristics in the algorithm that prevent attempting to construct haplotypes where there are clearly no reads able to inform phasing. This heuristic is applied before the downstream (w.r.t the current calling window) sites have been called, so it escapes the algorithm that these sites are actually homozygous and could have been included. You can possibly achieve the calls I proposed by adding the --backtrack-level AGGRESSIVE
option, as this will force the algorithm to try to join previously 'called' haplotypes.
I'm struggling to parse your second example. I don't see follow the calls Octopus makes with what I'm seeing on the pileups, plus Octopus is only calling two phase sets here, not 5 as you suggest. Did you post the right example?
Coming back to your question...
I have tested different parameter settings, but I am still not sure which ones would be best suited to improve results, as overall predictions work but are not really accurate. What parameters do I need to set to discriminate between highly similar alleles and enforce strict read phasing, --phasing-level conservative?
What exactly do you mean by "not really accurate", are you seeing false positive or negative variant calls, or do you believe that the phase calls are incorrect? I don't understand what you mean by "to discriminate between highly similar alleles and enforce strict read phasing", what do you mean by "similar" and "strict read phasing"? Octopus phases alleles using genotype posteriors, which are informed by genotype likelihoods (i.e. read information) and prior information, but the prior is only weakly informative for the polyclone model, so is unlikely to outweigh the information from the genotype likelihoods. If you simply want longer phase sets then there are numerous things to try:
--lagging-level AGGRESSIVE
--backtrack-level AGGRESSIVE
--extension-level AGGRESSIVE
--max-haplotypes 400
, or an even larger value (default is 200).--min-phase-score 5
, or lower (default is 10).But ultimately the phase length will be restricted by the information provided by the reads.
How can I extract the inferred haplotypes from the vcf or would I need to assemble reads in the bam-output with the same "hi" tag?
In what format do you want the inferred haplotypes (i.e. what do you mean by "extract")? If you simply want the sequence then you could try using something like bcftools consensus
, but you'll need to manually split the genotypes up into composing haplotypes. It wouldn't make sense to try to "assemble" reads using the hi
tag in this way since this is precisely what Octopus has already done. The hi
tag tell you which haplotype Octopus thinks the read originated from (if it is not ambiguous).
Thanks for the quick reply!
Just to clarify some terminology, as you appear to be using "allele" and "haplotype" inconsistently
Sorry for the confusion, I was referring to the complete gene by "allele", so a combination of several variants. So I should refer to this as haplotype? So a phase set would be a combination of variants and alleles that can be linked through co-occurence on reads?
If I'm reading these pileups correctly (each track corresponds to a strain?)
In the pileups the first track is sequencing data (i.e. the mix of strains) colored by hi tag (so the octopus output), the following tracks are simulated pileups corresponding to the correct haplotypes (?) of the strains that are present (so the ground truth for the haplotypes that I would like to reconstruct)
then it appears to me that Octopus has called this region correctly,
Yes for example1 the region has been called correctly except for the homozygous variants no. 4,5 in the 3 strains as you mentioned. I tried adding the backtrack-level option. Leading to the following result:
gene1 62 . A G 3076.53 PASS AC=1;AN=3;DP=637;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|1:3077:637:41:62:100:PASS
gene1 158 . G A 3076.53 PASS AC=2;AN=3;DP=1349;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1:3077:1349:41:62:100:PASS
gene1 162 . C T 3076.53 PASS AC=1;AN=3;DP=1340;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|1|0:3077:1340:41:62:100:PASS
gene1 419 . T C 3076.53 PASS AC=3;AN=3;DP=984;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1:3077:984:41:62:100:PASS
gene1 434 . C T 3076.53 PASS AC=3;AN=3;DP=958;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1:3040:958:41:62:100:PASS
3 genotypes: 0-1-0-1-1 (false positive) 0-0-1-1-1 (correct) 1-1-0-1-1 (correct)
In the bam output I see 4 different hi-tags
I'm struggling to parse your second example. I don't see follow the calls Octopus makes with what I'm seeing on the pileups, plus Octopus is only calling two phase sets here, not 5 as you suggest. Did you post the right example?
Sorry, I indeed posted the wrong vcf excerpt for example2, I edited the original post.
What exactly do you mean by "not really accurate", are you seeing false positive or negative variant calls, or do you believe that the phase calls are incorrect? I don't understand what you mean by "to discriminate between highly similar alleles and enforce strict read phasing", what do you mean by "similar" and "strict read phasing"?
The variant calls are accurate except for the genotype information, where I often see false positive, sometimes false negative genotypes inferred (so number of genotype and order of alleles within phase sets). I would like to have longer phase sets but also be more stringent as in genotypes/haplotypes are only inferred if the same alleles appear together on a large number of reads. However, then these alleles should be used to discriminate between haplotypes stringently. In this example when reads cover and have the same allele on 2 variants they inferred to come from "haplotype brown", but when they cover only 1 of these 2 variants they are assigned to blue:
While it is not a great example, e.g., in the pileup for example2 (see original post) there is an inconsistency for the "blue haplotype", where some of the reads in the middle do not share the same allele at SNV-pos179 (no. 7).
I will provide a few more examples as I test different settings.
In what format do you want the inferred haplotypes (i.e. what do you mean by "extract")
I meant extracting variant calls with the alleles corresponding to each haplotype. For example 1 this would be straight-forward simply looking at the GT field as the order is consistent. However, for example2 I could not know the order in which alleles in the GT field belong together across different phase sets. This seems to be different for "hi" fields that link reads with specific alleles across these phase sets, but not always correctly (e.g. for example2 "green hi" on the left and "red hi" in the middle part should be the same haplotype).
I was referring to the complete gene by "allele", so a combination of several variants. So I should refer to this as haplotype?
Yes, well actually a combination of alleles, which are instances of a variants.
So a phase set would be a combination of variants and alleles that can be linked through co-occurence on reads?
You can view it this way, although the truth is a little more complicated since you're ignoring prior information that could be used to determine the phasing of alleles. The prior model used by the polyclone model is relatively uninformative (compared with read information) in this regard. However, other prior models can be more informative and even be sufficient to determine phase (e.g. the trio prior model, since this essentially phases by decent).
Yes for example1 the region has been called correctly except for the homozygous variants no. 4,5 in the 3 strains as you mentioned. I tried adding the backtrack-level option. Leading to the following result:
The two homozygous variants are not called incorrectly, it's just the algorithm didn't include them in the same phase set as the other variants. It's important to be aware that the assertions about the number of haplotypes inferred is local to the phase sets; you can view phase sets as independent inferences. So although the algorithm inferred two haplotypes in one phase set, it's still perfectly valid that only one can be found in another. The maximum number of haplotypes inferred (over all phase sets) provides a lower bound on the number of (inferred) unique strains in your sample. However, only if the phase set covered the entire contig could you make a claim about the actual number.
3 genotypes: 0-1-0-1-1 (false positive) 0-0-1-1-1 (correct) 1-1-0-1-1 (correct)
On closer inspection if your pileups, this actually does appear correct. In addition to the two simulated strains, I can see two reads in the mixture (top and second bottom) that appear to support (0-1-0-1-1
). How did you generate these simulated reads?
Sorry, I indeed posted the wrong vcf excerpt for example2, I edited the original post.
I can see how these match up with the pileups now but not your breakdown...
first 5 SNVs in PS 53: 1-0-1-1-1 [1st, green] 1-0-1-0-0 (wrong) 0-1-0-0-0 [2nd and 3rd, blue in the beginning]
That's not what is reported in the VCF, in this notation it would be
0-1-1-1-1 [1st] 1-0-0-0-0 [2nd] 0-0-0-0-0 [3rd]
I can't tell if this is correct or not since your mixed reads don't cover the first variant. However, it appears to me that you have strong read support in the mixture for haplotypes not present any of the "truth" strains. For example, there are at least 12 reads supporting a haplotype with [89:A
, 92:T
, 101:T
, 107:C
] (i.e. [REF
, REF
, ALT
, ALT
]), but this haplotype doesn't appear in any strain. The closest would be [89:G
, 92:C
, 101:T
, 107:C
] (i.e. [ALT
, ALT
, ALT
, ALT
], the 'green' haplotype). This - and the unexpected reads in example 1 - strongly suggests to me something is going wrong in your simulated mixture, which is probably confusing Octopus. Note that the maximum number of called haplotypes is capped by --max-clones
, so if you actually have more haplotypes than this in your data then your results will be wrong. In principle, the default (3) should be sufficient for your purpose, but I strongly suspect that you've inadvertently simulated more haplotypes than this.
In this example when reads cover and have the same allele on 2 variants they inferred to come from "haplotype brown", but when they cover only 1 of these 2 variants they are assigned to blue
Most likely is that the haplotype index for blue is used denote an ambiguous assignment (i.e. two or more of the called haplotype are equally probable). This makes sense if you look at the pileup, the brown reads cover the A
SNV and the G
SNV, but the blue reads only manage to cover the A
SNV, I'd bet that there was a called haplotype that included the reference allele at the A
SNV + the G
SNV (although there are no reads supporting this haplotype in this pileup), and therefore the likelihood of these two haplotypes would be equal for the blue reads.
I could not know the order in which alleles in the GT field belong together across different phase sets.
This is exactly the purpose of phase sets, to inform you when you can and can't assume consistent ordering. If you want to be able to phase these variants then you'll tune the algorithm to achieve larger phase lengths, but ultimately you will be limited by your experimental design, in which case you'll need to consider alternatives, such as. longer or linked reads (e.g. 10X).
One more point, Octopus can currently only report fully phased phase sets, it's possible that partial phasing could be achieved in cases of three or more haplotypes but not full phasing. This is something I'm hoping to address over the next few months.
3 genotypes: 0-1-0-1-1 (false positive) 0-0-1-1-1 (correct) 1-1-0-1-1 (correct)
On closer inspection if your pileups, this actually does appear correct. In addition to the two simulated strains, I can see two reads in the mixture (top and second bottom) that appear to support (0-1-0-1-1). How did you generate these simulated reads?
The reads in the hi-colored pileups are not simulated, so they could include sequencing reads that actually support these "incorrect" inferences e.g. because of sequencing errors. The other pileups are simulated reads that should merely illustrate which variants to expect for the correct haplotypes.
One aspect I would like to improve here is to make the genotype inference more stringent here so that these reads do not lead to a false positive prediction. I should have linked to the full pileups instead of adding screenshots with which I could not show the full pileups. So these reads that support these seemingly wrong predictions are very few in the full context. I think another aspect that is important here is more is also a more stringent read pre-processing as these "errors" seem to often appear in the end and beginning of the reads
I can see how these match up with the pileups now but not your breakdown... That's not what is reported in the VCF, in this notation it would be 0-1-1-1-1 [1st] 1-0-0-0-0 [2nd] 0-0-0-0-0 [3rd]
I might have overwritten the VCF with a different run by accident, so you are correct that my breakdown does not match the VCF output. However, 0-0-0-0-0 should also not be detected.
The reads in the hi-colored pileups are not simulated, so they could include sequencing reads that actually support these "incorrect" inferences e.g. because of sequencing errors.
Can you provide a BAM file of the mixture for these two genes? I'm happy to take a closer look, but my prior on two independent reference-reverting sequencing errors occurring in otherwise clean reads is pretty small.
Can I send the files via email?
Sure, if it's small enough to get past the email servers... dcooke@well.ox.ac.uk.
Ok, I've had a look at the BAM and I'm pretty happy with what Octopus is doing here. There is very strong read support for the haplotypes that you claim should not be present.
Looking at the first example (glnA
), Octopus calls the following:
$ octopus -R all_bactag.fa -I MLST-2T-Spiked-Mix.sorted.bam -C polyclone -T glnA --max-clones 5 --disable-downsampling -o glnA.vcf --bamout glnA.bam --backtrack AGGRESSIVE --max-haplotypes 400
$ bcftools view -H glnA.vcf
glnA 62 . A G 33776 PASS AC=2;AN=4;DP=680;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|1|1:10000:680:41:62:100:PASS
glnA 158 . G A 54696.8 PASS AC=2;AN=4;DP=3167;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|0:10000:3167:41:62:100:PASS
glnA 162 . C T 30438.2 PASS AC=2;AN=4;DP=3203;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|1|0|1:10000:3203:41:62:100:PASS
glnA 419 . T C 100000 PASS AC=4;AN=4;DP=3075;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1:10000:3075:41:62:100:PASS
glnA 434 . C T 100000 PASS AC=4;AN=4;DP=2720;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1:10000:2720:41:62:100:PASS
So, Octopus is claiming 4 haplotypes. Now lets look at the evidence BAM. To make it a bit easier to show the different haplotypes I'll split the evidence BAM by haplotype support using the bundled Python script:
$ split_realigned_bam.py -b glnA.bam -o glnA.split
The resulting split BAMs look like:
Note that the last track contains the ambiguous reads. The approximate depth for each of the first four tracts is 180x, 715x, 500x, and 80x. This is pretty convincing imho.
Now turning to glyA
:
$ octopus -R all_bactag.fa -I MLST-2T-Spiked-Mix.sorted.bam -C polyclone -T glyA --max-clones 12 --disable-downsampling -o glnA.vcf --bamout glnA.bam --backtrack AGGRESSIVE --max-haplotypes 400
$ split_realigned_bam.py -b glyA.bam -o glyA.split
$ bcftools view -H glyA.vcf
glyA 53 . T C 26086.5 PASS AC=4;AN=10;DP=145;MQ=36;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|0|0|0|0|0|0:968:145:36:53:100:PASS
glyA 71 . T C 22873.8 PASS AC=6;AN=10;DP=279;MQ=37;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|0|1|1|1|1|1|1:662:279:37:53:100:PASS
glyA 89 . A G 26086.5 PASS AC=4;AN=10;DP=373;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|0|0|0|0|0|0:968:373:38:53:100:PASS
glyA 92 . T C 26086.5 PASS AC=4;AN=10;DP=384;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|0|0|0|0|0|0:968:384:38:53:100:PASS
glyA 101 . C T 26086.5 PASS AC=5;AN=10;DP=420;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|0|0|0|0|0|1:662:420:38:53:100:PASS
glyA 107 . T C 26086.5 PASS AC=5;AN=10;DP=439;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|0|0|0|0|0|1:662:439:38:53:100:PASS
glyA 119 . C T 22873.8 PASS AC=5;AN=10;DP=574;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|0|1|1|1|1|1|0:968:574:39:53:100:PASS
glyA 179 . G A 25642.5 PASS AC=4;AN=10;DP=819;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|0|0|0|0|0|0|1:620:819:39:53:100:PASS
glyA 251 . T C 22873.8 PASS AC=5;AN=10;DP=962;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|1|1|1|1|1|0|0:6093:962:39:53:100:PASS
glyA 266 . T C 39313.4 PASS AC=5;AN=10;DP=993;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|0|0|0|0|0|1|1:1381:993:38:53:100:PASS
glyA 281 . T C 22873.8 PASS AC=5;AN=10;DP=912;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|1|1|1|1|1|0|0:1383:912:38:53:100:PASS
glyA 287 . A T 22873.8 PASS AC=5;AN=10;DP=927;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|1|1|1|1|1|0|0:1381:927:38:53:100:PASS
glyA 299 . T C 22873.8 PASS AC=5;AN=10;DP=914;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|1|1|1|1|1|0|0:5320:914:38:53:100:PASS
glyA 308 . G A 22873.8 PASS AC=5;AN=10;DP=903;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|1|1|1|1|1|0|0:5320:903:38:53:100:PASS
glyA 314 . T C,G 22873.8 PASS AC=1,4;AN=10;DP=934;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|2|1|2|2|2|0|0:4838:934:38:53:100:PASS
glyA 317 . G A 22873.8 PASS AC=1;AN=10;DP=979;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|0|1|0|0|0|0|0:4838:979:38:53:100:PASS
glyA 335 . A G 22873.8 PASS AC=5;AN=10;DP=1047;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|1|1|1|0|1|1|0|0:1381:1047:39:53:100:PASS
glyA 336 . C T 39050.2 PASS AC=5;AN=10;DP=1044;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|0|0|0|1|0|0|1|1:1381:1044:39:53:100:PASS
glyA 350 . A G 22873.8 PASS AC=1;AN=10;DP=1024;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|0|1|0|0|0|0|0:4838:1024:39:53:100:PASS
glyA 380 . T C 36434.6 PASS AC=5;AN=10;DP=946;MQ=40;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|0|0|0|0|0|1|1:8069:946:40:53:100:PASS
glyA 402 . A G 22873.8 PASS AC=1;AN=10;DP=955;MQ=40;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|0|1|0|0|0|0|0:4838:955:40:53:100:PASS
glyA 440 . G C 30368.2 PASS AC=5;AN=10;DP=965;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|0|0|0|1|0|1|1:1335:965:41:53:100:PASS
glyA 554 . C T 24199 PASS AC=6;AN=10;DP=234;MQ=40;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|1|1|1|0|1|0|1|1:502:234:40:53:100:PASS
Note that I set --max-clones 12
but only 10 haplotypes were inferred. The split BAMs look like
Once again, it's pretty convincing. There's no way this is all due to sequencing errors - something else is going on in your data that is resulting in more haplotypes than you expect.
I've made some improvements to the polyclone
calling model since looking into this example - it should be faster now. I'd recommend checking out the latest develop branch.
Having starred at the second set of pileups a bit longer it appeared that there are some more haplotypes in this data (i.e. more than 10). I found a bug that was causing Octopus to fail when the number of clones went above 10, which I've now fixed in b7a5a3e07a13c241ab7a5c545947d52ec38016c5. I've just re-run with --max-clones 14
and got calls with 14 haplotypes, so I would guess that there is at-least this number.
It appears the magic number is 25!
$ octopus -R all_bactag.fa -I MLST-2T-Spiked-Mix.sorted.bam -C polyclone -T glyA --max-clones 30 --disable-downsampling -o glyA.vcf --bamout glyA.bam --backtrack AGGRESSIVE --max-haplotypes 1000 --max-genotypes 10000
bcftools view -H glyA.vcf
glyA 53 . T C 27850.7 PASS AC=14;AN=25;DP=145;MQ=36;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:10000:145:36:53:100:PASS
glyA 71 . T C 24638 PASS AC=12;AN=25;DP=279;MQ=37;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|0:3:279:37:53:100:PASS
glyA 89 . A G 27850.7 PASS AC=12;AN=25;DP=373;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:198:373:38:53:100:PASS
glyA 92 . T C 27850.7 PASS AC=12;AN=25;DP=384;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:198:384:38:53:100:PASS
glyA 101 . C T 27850.7 PASS AC=14;AN=25;DP=420;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:10000:420:38:53:100:PASS
glyA 107 . T C 27850.7 PASS AC=14;AN=25;DP=439;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:10000:439:38:53:100:PASS
glyA 119 . C T 24638 PASS AC=10;AN=25;DP=574;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|0:10000:574:39:53:100:PASS
glyA 179 . G A 27850.7 PASS AC=12;AN=25;DP=819;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0:198:819:39:53:100:PASS
glyA 251 . T C 24638 PASS AC=10;AN=25;DP=962;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|0|0|0:338:962:39:53:100:PASS
glyA 266 . T C 39208.1 PASS AC=15;AN=25;DP=993;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|1|0|1|1|1|1|1|1|1|1|1|1|0|1|0|0|0|0|0|0|0|1|1|1:338:993:38:53:100:PASS
glyA 281 . T C 24638 PASS AC=10;AN=25;DP=912;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|0|0|0|0|0|0|0|0|0|0|1|0|1|1|1|1|1|1|0|1|0|0:570:912:38:53:100:PASS
glyA 287 . A T 24638 PASS AC=10;AN=25;DP=927;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|0|0|0|0|0|0|0|0|0|0|1|0|1|1|1|1|1|1|0|1|0|0:570:927:38:53:100:PASS
glyA 299 . T C 24638 PASS AC=9;AN=25;DP=914;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|0|1|0|0:570:914:38:53:100:PASS
glyA 308 . G A 24638 PASS AC=10;AN=25;DP=903;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|1|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|0|1|0|0:570:903:38:53:100:PASS
glyA 314 . T C,G 24638 PASS AC=3,7;AN=25;DP=934;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 2|0|2|2|0|0|0|0|0|0|0|0|0|0|0|1|1|1|2|2|2|0|2|0|0:289:934:38:53:100:PASS
glyA 317 . G A 24638 PASS AC=3;AN=25;DP=979;MQ=38;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|0|0|0|0|0|0|0:289:979:38:53:100:PASS
glyA 335 . A G 24638 PASS AC=11;AN=25;DP=1047;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|0|1|1|0|0|0|0|0|0|0|1|1|0|0|1|1|1|0|1|1|0|1|0|0:391:1047:39:53:100:PASS
glyA 336 . C T 40814.4 PASS AC=14;AN=25;DP=1044;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|1|0|0|1|1|1|1|1|1|1|0|0|1|1|0|0|0|1|0|0|1|0|1|1:440:1044:39:53:100:PASS
glyA 350 . A G 24638 PASS AC=4;AN=25;DP=1024;MQ=39;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|1|0|0|0|0|0|0|1|0|0|0|0|0|1|1|0|0|0|0|0|0|0:328:1024:39:53:100:PASS
glyA 380 . T C 38198.8 PASS AC=12;AN=25;DP=946;MQ=40;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|1|0|0|1|1|1|1|0|0|0|1|0|1|1|0|0|0|1|0|0|1|0|1|1:311:946:40:53:100:PASS
glyA 402 . A G 24638 PASS AC=6;AN=25;DP=955;MQ=40;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 0|0|0|1|0|0|0|1|0|1|1|0|0|0|0|0|1|1|0|0|0|0|0|0|0:464:955:40:53:100:PASS
glyA 440 . G C 32132.5 PASS AC=11;AN=25;DP=965;MQ=41;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|0|0|1|1|0|0|0|0|0|1|0|1|1|0|0|0|1|0|0|1|0|1|1:245:965:41:53:100:PASS
glyA 554 . C T 25963.2 PASS AC=14;AN=25;DP=234;MQ=40;NS=1 GT:GQ:DP:MQ:PS:PQ:FT 1|1|1|0|0|1|0|1|0|0|0|1|0|1|1|0|0|1|0|0|1|1|1|1|1:51:234:40:53:100:PASS
For the glnA example I agree with your assessment that there is strong read support, this should not be a coincidence and I will try to find out what has happened here. However it seems that the "correct" haplotypes have a much higher depth than the others.
For the glyA example I have a hard time seeing where reads overlap but there seems to be at least some examples were the evidence for the haplotypes seems to be relying on few overlapping reads. In your example with 10 haplotypes the "correct" haplotypes are _5, _6, and _9. Also here it seems that quite some of the correct reads end up in _10, which I assume is the unassignable fraction.
The most likely explanation that comes to mind would be that the strains used in the test sample have not been pure as assumed, I will look into this again.
Thanks a lot again for looking into this and the impressively quick response time. Feel free to close this issue.
Hello,
thank you for your efforts in providing this tool. I have a question that I could not solve from the extensive documentation and thought maybe you could help me out.
My use case: I have mixtures of bacterial strains and I would like to discriminate the strains using sequencing data for several marker genes that have been amplified. I thought I could use this tool (with the
-C polyclone -P 1
option) to identify variants on these marker genes and reconstruct which alleles are likely to be present in the mixture. The examples shown below are from a test sample with a known mix of 3 strains, where strains can share the same allele (example1) or have distinct alleles (example2) of a marker gene. This is the command I used for the examples:My questions: I have tested different parameter settings, but I am still not sure which ones would be best suited to improve results, as overall predictions work but are not really accurate. What parameters do I need to set to discriminate between highly similar alleles and enforce strict read phasing,
--phasing-level conservative
? How can I extract the inferred haplotypes from the vcf or would I need to assemble reads in the bam-output with the same "hi" tag?I am also not sure if I understand the vcf output correctly. I found a previous issue that helped me a bit (https://github.com/luntergroup/octopus/issues/32) but this was not specific for ploidy 1. I assume that from the GT field I can reconstruct the genotypes as the order is consistent within the same phase group. So in the following case from the output vcf (example1):
There is phase group 62 (first 3 SNVs:62, 158, 162) and PS 419 with 2 SNVs (419, 434) that would give me 2 distinct alleles regarding these 5 variants: allele1: 1-1-0-1-1 allele2: 0-0-1-1-1 This is already a bit confusing for the haplotype reconstruction as only one of the predicted haplotypes (green) has the last 2 SNVs (the lower 2 tracks are simulated reads for the correct alleles):
A more complex example with a different gene (with 3 distinct alleles):
Here I have several phase groups and tried to "reconstruct" alleles similarly as in example1, in square brackets I noted to which correct phenotype (simulated read tracks) a combination corresponded to: first 5 SNVs in PS 53: 1-0-1-1-1 [1st, green] 1-0-1-0-0 (wrong) 0-1-0-0-0 [2nd and 3rd, blue in the beginning] 2 SNVs in PS 107: 1-0 [1st]
1-0 [1st]
0-1 [2nd and 3rd] 9 SNVs in PS 179: 1-0-1-0-0-0-0-0-0 [1st] 0-1-0-1-1-1-1-1-1 [3rd] 0-1-0-1-1-1-1-2-0 [2nd] 5 SNVs in PS 335: 0-1-0-1-0 [1st] 1-0-0-0-0 [2nd] 1-0-1-0-1 [3rd] last SNVs 2 in PS 440: 1-1 [1st] 0-0 [2nd and 3rd] 0-1 (wrong, but there are actually several reads with this combination)
Why are the same genotypes within a phase group collapsed sometimes (example1, PS:419) but not for others (example2 PS:107)?
output of
octopus --version
:Thanks for your help.