Closed bredelings closed 4 years ago
I can't seem to find any tools that will process the VCF files coming out of octopus. Mainly the tools don't like "<*>". Can you give a recommendation?
I'm surprised bcftools doesn't fully support <*>
since this is the recommended representation for gvcf non-reference alleles according to the VCF Spec 4.3 (see section 5.5). Probably worth an issue on the bcftools GitHub if there isn't already. As for alternatives, perhaps try the GATK tools? I don't use gvcf much myself so don't have a good idea about analysis tools.
The --legacy option has been removed on the dev branch. Any update on creating a post-processing script?
This is pretty simple to do with pysam. I may include such a script in the next release, but I removed the --legacy
option because I think it's important to work with *
and downstream tools should have better support for it.
With the positional gvcf at least, it would seem that I should not need a reference in order to extract a fasta. Do you know of any tools that do that?
That's correct, apart from Octopus doesn't always report a record at each position (if the posterior is less than --min-refcall-posterior
). I don't know of any tools, but it should be pretty simple to do with pysam.
In issue #100, you gave some hints about removing alleles, and modifying such alleles to point to 0. Does this same advice hold for <> in gvcf files?
No since you always need an ALT
allele, even though the GT
of reference calls should always be 0|0
(for diploid). *
and <*>
are - confusingly - unrelated.
I think I have played around with a python vcf library. In theory I could load each record and modify the record and/or just write out the allele for the record. However, I saw you had an extended argument with Heng Li about making vcf support *. So, maybe there would be an issue if there are bubbles in the assembly graph?
Not exactly sure of your point here; you mean *
would indicate a heterozygous position?
I can't seem to find any tools that will process the VCF files coming out of octopus. Mainly the tools don't like "<*>". Can you give a recommendation?
I'm surprised bcftools doesn't fully support |<*>| since this is the recommended representation for gvcf non-reference alleles according to the VCF Spec 4.3 https://urldefense.com/v3/__https://samtools.github.io/hts-specs/VCFv4.3.pdf__;!!OToaGQ!86r9todo0cAywrr8HqVGKDJuUOsqDCHSjwf93OsVxcFnTZ7vfCZZ8jaYf63CdHn5iBAOkrU$ (see section 5.5). Probably worth an issue on the bcftools GitHub if there isn't already. As for alternatives, perhaps try the GATK tools? I don't use gvcf much myself so don't have a good idea about analysis tools.
Apparently the development version of bcftools on github now handles
<*>
. I was going to file an issue with bcftools as you suggested, but I found a closed bcftools issue stating that the feature is now implemented, so thanks.
I was able to get a filtered version of the chromosome with variants applied :-)
The command looked like this, which hopefully isn't completely wrong:
bcftools consensus -f denovo.chr1.fasta denovo.chr1.positional.gvcf.gz -M N -a N -i 'FORMAT/FT == "PASS"' > chr1.masked-denovo.fasta
The --legacy option has been removed on the dev branch. Any update on creating a post-processing script?
This is pretty simple to do with pysam. I may include such a script in the next release, but I removed the |--legacy| option because I think it's important to work with |*| and downstream tools should have better support for it.
Now that bcftools supports <*>, maybe writing a script is less urgent. Perhaps removing the --legacy option paid off? Not sure.
With the positional gvcf at least, it would seem that I should not need a reference in order to extract a fasta. Do you know of any tools that do that?
That's correct, apart from Octopus doesn't always report a record at each position (if the posterior is less than |--min-refcall-posterior|).
With the '-a N' flag to bcftools, missing positions are replaced with
N
, which is what I was hoping for.I don't know of any tools, but it should be pretty simple to do with pysam.
Luckily I don't need to do this now, but were you suggesting that I filter based on the evidence BAM? I do have a script that parses a BAM, tries to extract the majority allele at each reference position, and writes out Ns if the coverage is too low. I don't think it handles insertions or deletions though, whereas in theory I presume bcftools consensus + octopus gvcf would apply indel variants.
In issue #100 <https://urldefense.com/v3/__https://github.com/luntergroup/octopus/issues/100__;!!OToaGQ!86r9todo0cAywrr8HqVGKDJuUOsqDCHSjwf93OsVxcFnTZ7vfCZZ8jaYf63CdHn5SuIBAXE$>, you gave some hints about removing * alleles, and modifying such alleles to point to 0. Does this same advice hold for <*> in gvcf files?
No since you always need an |ALT| allele, even though the |GT| of reference calls should always be |0|0| (for diploid). || and |<>| are - confusingly - unrelated.
OK, thanks! Searching for info on biostars and such indicates that people are indeed confused by and <>. Though I suppose I should actually read the spec.
I think I have played around with a python vcf library. In theory I could load each record and modify the record and/or just write out the allele for the record. However, I saw you had an extended argument with Heng Li about making vcf support *. So, maybe there would be an issue if there are bubbles in the assembly graph?
Not exactly sure of your point here; you mean |*| would indicate a heterozygous position?
Yeah, that was unclear, probably because I wasn't quite sure what I was asking about.
I guess I'm wondering if (i) two records can overlap the same reference positions and (ii) the two records can specify different alleles.
bcftools does print a few complaints:
Note: the --sample option not given, applying all records regardless of the genotype
The site pseudo_used_LT635612:172134 overlaps with another variant, skipping...
The site pseudo_used_LT635612:421914 overlaps with another variant, skipping...
The site pseudo_used_LT635612:473895 overlaps with another variant, skipping...
The site pseudo_used_LT635612:473920 overlaps with another variant, skipping...
The site pseudo_used_LT635612:504738 overlaps with another variant, skipping...
The site pseudo_used_LT635612:597850 overlaps with another variant, skipping...
The site pseudo_used_LT635612:598007 overlaps with another variant, skipping...
The site pseudo_used_LT635612:599334 overlaps with another variant, skipping...
The site pseudo_used_LT635612:599354 overlaps with another variant, skipping...
The site pseudo_used_LT635612:599362 overlaps with another variant, skipping...
The site pseudo_used_LT635612:600135 overlaps with another variant, skipping...
The site pseudo_used_LT635612:600313 overlaps with another variant, skipping...
Applied 186 variants
Should I worry about this? I am not sure. I'm not sure if bcftools is taking the first record that occurs in the file, or its skipping all records for positions with overlapping records.
Thanks again for your help!
-BenRI
P.S. I'm not sure of the best way to do this, but I'm wondering how best to add this info to the docs. I could add a page to the wiki on using gvcfs. I don't see the source code for the pdf manual.
Hi, Dan, you mention it is easy to convert the * allele to something downstream tools can use. I haven't used pysam before, how would I go about this? I want to use Octopus gvcfs with GLnexus.
Which version of Octopus are you using and what downstream tool? The latest develop branch of Octopus is compliant with VCF 4.3 that allows *
...
I am using the latest develop version as of last week when you fixed my refcall issue. I want to cohort call with GLnexus. I understand that Octopus is compliant but GLnexus is not and they don't have immediate plans for support, I've asked.
Oh I see, you can use something like the following to convert genotyped *
records to REF
then:
import pysam as ps
in_vcf = ps.VariantFile(in_vcf_filename)
out_vcf = ps.VariantFile(out_vcf_name, 'w', header=in_vcf.header)
for rec in in_vcf:
if '*' in rec.alts:
star_indices = []
for idx, allele in enumerate(rec.alts):
if allele == '*':
star_indices.append(idx)
for sample in in_vcf.header.samples:
rec.samples[sample]['GT'] = [idx for idx in rec.samples[sample]['GT'] if idx not in star_indices else 0]
out_vcf.write(rec)
out_vcf.write.close()
I haven't tested this! Note the check for multiple instances of *
in the ALT
s - this is because Octopus sometimes reports multiple *
on the same record in an attempt to indicate different partial overlaps.
Hi, Dan,
I got some entries after running this that look like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERR2570720
chr6H 199256913 . CA *,* 36.77 PASS AC=1,1;AN=2;DP=10;MQ=60;NS=1;RFGQ_ALL=8.06 GT:GQ:DP:MQ:PS:PQ:RFGQ:FT2/0:8:10:60:199256912:21:8.06:PASS
2 alternate * alleles. Downstream, software doesn't like this and it doesn't make sense, how should I modify the above code to avoid this in future?
Cheers,
Max.
Hmm, what's the record at this site before running the script? This is certainly odd but technically still valid VCF as far as I can tell.
Interestingly, it is identical before the script. My bad, thought it had something to do with the merging of records.
GLnexus doesn't like it though, somthing about the alternate alleles being non-distinct (have lost the error message).
Would it be appropriate to replace *,* with * ?
Looking at it closer I think see that there are two dissimilar alleles at this location? Possibly deletions? So it would not make sense to replace. Is removal my only option?
There's nothing in the VCF spec that indicates ALT
alleles must be distinct, and actually there are some cases - although apparently not this one - where it might be helpful to report more than one *
ALT
allele. I'd report this is an issue to GLnexus.
Can you send a sub-BAM of this region for me to take a look at?
By the way, I think that you might ultimately hit a dead-end trying to use Octopus + GLnexus for joint calling because Octopus does not report genotype likelihoods that I believe GLnexus requires.
its actually 100 kilobases centered on the allele.
Thats too bad but not cohort calling isn't the biggest deal, mostly I just want to merge the gvcfs and then internally impute using beagle.
Hi,
I used the development branch of octopus to compute POSITIONAL and BLOCKED gvcf files for a malaria dataset. I want to generate a FASTA sequence from each gvcf file that replaces each site that is missing or does not PASS with an N. (One goal is just to remove sites without sufficient depth.)
I am a newbie with VCF, but I can't seem to find any tools that will process the VCF files coming out of octopus. Mainly the tools don't like "<*>". Can you give a recommendation?
, so it seems I cannot use that.With the positional gvcf at least, it would seem that I should not need a reference in order to extract a fasta. Do you know of any tools that do that?
In issue #100, you gave some hints about removing alleles, and modifying such alleles to point to
0
. Does this same advice hold for <> in gvcf files?I think I have played around with a python vcf library. In theory I could load each record and modify the record and/or just write out the allele for the record. However, I saw you had an extended argument with Heng Li about making vcf support *. So, maybe there would be an issue if there are bubbles in the assembly graph? Not sure.
Anyway, I'm sorry that this is not really a bug report, but more of a combination of a request for help, a request for documentation, and a feature request. Thanks for any advice you might have!