luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

How to use gvcf files? #116

Closed bredelings closed 4 years ago

bredelings commented 4 years ago

Hi,

I used the development branch of octopus to compute POSITIONAL and BLOCKED gvcf files for a malaria dataset. I want to generate a FASTA sequence from each gvcf file that replaces each site that is missing or does not PASS with an N. (One goal is just to remove sites without sufficient depth.)

  1. I am a newbie with VCF, but I can't seem to find any tools that will process the VCF files coming out of octopus. Mainly the tools don't like "<*>". Can you give a recommendation?

    • The --legacy option has been removed on the dev branch. Any update on creating a post-processing script?
    • bcftools 1.10.2 still claims not to support "<*>" or any symbolic alleles beside , so it seems I cannot use that.
    • vcftools 0.1.15 writes out literal "<><><*>..." to the fasta file. Not sure about later versions.
    • there is something called gvcftools, but it seems to be abandoned. Is there something I'm missing?
  2. With the positional gvcf at least, it would seem that I should not need a reference in order to extract a fasta. Do you know of any tools that do that?

  3. In issue #100, you gave some hints about removing alleles, and modifying such alleles to point to 0. Does this same advice hold for <> in gvcf files?

  4. I think I have played around with a python vcf library. In theory I could load each record and modify the record and/or just write out the allele for the record. However, I saw you had an extended argument with Heng Li about making vcf support *. So, maybe there would be an issue if there are bubbles in the assembly graph? Not sure.

Anyway, I'm sorry that this is not really a bug report, but more of a combination of a request for help, a request for documentation, and a feature request. Thanks for any advice you might have!

dancooke commented 4 years ago

I can't seem to find any tools that will process the VCF files coming out of octopus. Mainly the tools don't like "<*>". Can you give a recommendation?

I'm surprised bcftools doesn't fully support <*> since this is the recommended representation for gvcf non-reference alleles according to the VCF Spec 4.3 (see section 5.5). Probably worth an issue on the bcftools GitHub if there isn't already. As for alternatives, perhaps try the GATK tools? I don't use gvcf much myself so don't have a good idea about analysis tools.

The --legacy option has been removed on the dev branch. Any update on creating a post-processing script?

This is pretty simple to do with pysam. I may include such a script in the next release, but I removed the --legacy option because I think it's important to work with * and downstream tools should have better support for it.

With the positional gvcf at least, it would seem that I should not need a reference in order to extract a fasta. Do you know of any tools that do that?

That's correct, apart from Octopus doesn't always report a record at each position (if the posterior is less than --min-refcall-posterior). I don't know of any tools, but it should be pretty simple to do with pysam.

In issue #100, you gave some hints about removing alleles, and modifying such alleles to point to 0. Does this same advice hold for <> in gvcf files?

No since you always need an ALT allele, even though the GT of reference calls should always be 0|0 (for diploid). * and <*> are - confusingly - unrelated.

I think I have played around with a python vcf library. In theory I could load each record and modify the record and/or just write out the allele for the record. However, I saw you had an extended argument with Heng Li about making vcf support *. So, maybe there would be an issue if there are bubbles in the assembly graph?

Not exactly sure of your point here; you mean * would indicate a heterozygous position?

bredelings commented 4 years ago
I can't seem to find any tools that will process the VCF files
coming out of octopus. Mainly the tools don't like "<*>". Can you
give a recommendation?

I'm surprised bcftools doesn't fully support |<*>| since this is the recommended representation for gvcf non-reference alleles according to the VCF Spec 4.3 https://urldefense.com/v3/__https://samtools.github.io/hts-specs/VCFv4.3.pdf__;!!OToaGQ!86r9todo0cAywrr8HqVGKDJuUOsqDCHSjwf93OsVxcFnTZ7vfCZZ8jaYf63CdHn5iBAOkrU$ (see section 5.5). Probably worth an issue on the bcftools GitHub if there isn't already. As for alternatives, perhaps try the GATK tools? I don't use gvcf much myself so don't have a good idea about analysis tools.

Apparently the development version of bcftools on github now handles <*>.  I was going to file an issue with bcftools as you suggested, but I found a closed bcftools issue stating that the feature is now implemented, so thanks.

I was able to get a filtered version of the chromosome with variants applied :-)

The command looked like this, which hopefully isn't completely wrong:

bcftools consensus -f denovo.chr1.fasta denovo.chr1.positional.gvcf.gz -M N -a N -i 'FORMAT/FT == "PASS"' > chr1.masked-denovo.fasta
The --legacy option has been removed on the dev branch. Any update
on creating a post-processing script?

This is pretty simple to do with pysam. I may include such a script in the next release, but I removed the |--legacy| option because I think it's important to work with |*| and downstream tools should have better support for it.

Now that bcftools supports <*>, maybe writing a script is less urgent.  Perhaps removing the --legacy option paid off?  Not sure.

With the positional gvcf at least, it would seem that I should not
need a reference in order to extract a fasta. Do you know of any
tools that do that?

That's correct, apart from Octopus doesn't always report a record at each position (if the posterior is less than |--min-refcall-posterior|).

With the '-a N' flag to bcftools, missing positions are replaced with N, which is what I was hoping for.

I don't know of any tools, but it should be pretty simple to do with pysam.

Luckily I don't need to do this now, but were you suggesting that I filter based on the evidence BAM?  I do have a script that parses a BAM, tries to extract the majority allele at each reference position, and writes out Ns if the coverage is too low. I don't think it handles insertions or deletions though, whereas in theory I presume bcftools consensus + octopus gvcf would apply indel variants.

In issue #100
<https://urldefense.com/v3/__https://github.com/luntergroup/octopus/issues/100__;!!OToaGQ!86r9todo0cAywrr8HqVGKDJuUOsqDCHSjwf93OsVxcFnTZ7vfCZZ8jaYf63CdHn5SuIBAXE$>,
you gave some hints about removing * alleles, and modifying such
alleles to point to 0. Does this same advice hold for <*> in gvcf
files?

No since you always need an |ALT| allele, even though the |GT| of reference calls should always be |0|0| (for diploid). || and |<>| are - confusingly - unrelated.

OK, thanks!  Searching for info on biostars and such indicates that people are indeed confused by and <>.  Though I suppose I should actually read the spec.

I think I have played around with a python vcf library. In theory
I could load each record and modify the record and/or just write
out the allele for the record. However, I saw you had an extended
argument with Heng Li about making vcf support *. So, maybe there
would be an issue if there are bubbles in the assembly graph?

Not exactly sure of your point here; you mean |*| would indicate a heterozygous position?

Yeah, that was unclear, probably because I wasn't quite sure what I was asking about.

I guess I'm wondering if (i) two records can overlap the same reference positions and (ii) the two records can specify different alleles.

bcftools does print a few complaints:

Note: the --sample option not given, applying all records regardless of the genotype
The site pseudo_used_LT635612:172134 overlaps with another variant, skipping...
The site pseudo_used_LT635612:421914 overlaps with another variant, skipping...
The site pseudo_used_LT635612:473895 overlaps with another variant, skipping...
The site pseudo_used_LT635612:473920 overlaps with another variant, skipping...
The site pseudo_used_LT635612:504738 overlaps with another variant, skipping...
The site pseudo_used_LT635612:597850 overlaps with another variant, skipping...
The site pseudo_used_LT635612:598007 overlaps with another variant, skipping...
The site pseudo_used_LT635612:599334 overlaps with another variant, skipping...
The site pseudo_used_LT635612:599354 overlaps with another variant, skipping...
The site pseudo_used_LT635612:599362 overlaps with another variant, skipping...
The site pseudo_used_LT635612:600135 overlaps with another variant, skipping...
The site pseudo_used_LT635612:600313 overlaps with another variant, skipping...
Applied 186 variants

Should I worry about this?  I am not sure.  I'm not sure if bcftools is taking the first record that occurs in the file, or its skipping all records for positions with overlapping records.

Thanks again for your help!

-BenRI

P.S. I'm not sure of the best way to do this, but I'm wondering how best to add this info to the docs.  I could add a page to the wiki on using gvcfs.  I don't see the source code for the pdf manual.

DiDeoxy commented 4 years ago

Hi, Dan, you mention it is easy to convert the * allele to something downstream tools can use. I haven't used pysam before, how would I go about this? I want to use Octopus gvcfs with GLnexus.

dancooke commented 4 years ago

Which version of Octopus are you using and what downstream tool? The latest develop branch of Octopus is compliant with VCF 4.3 that allows *...

DiDeoxy commented 4 years ago

I am using the latest develop version as of last week when you fixed my refcall issue. I want to cohort call with GLnexus. I understand that Octopus is compliant but GLnexus is not and they don't have immediate plans for support, I've asked.

dancooke commented 4 years ago

Oh I see, you can use something like the following to convert genotyped * records to REF then:

import pysam as ps

in_vcf = ps.VariantFile(in_vcf_filename)
out_vcf = ps.VariantFile(out_vcf_name, 'w', header=in_vcf.header)
for rec in in_vcf:
    if '*' in rec.alts:
        star_indices = []
        for idx, allele in enumerate(rec.alts):
            if allele == '*':
                star_indices.append(idx)
        for sample in in_vcf.header.samples:
            rec.samples[sample]['GT'] = [idx for idx in rec.samples[sample]['GT'] if idx not in star_indices else 0]
    out_vcf.write(rec)
out_vcf.write.close()

I haven't tested this! Note the check for multiple instances of * in the ALTs - this is because Octopus sometimes reports multiple * on the same record in an attempt to indicate different partial overlaps.

DiDeoxy commented 4 years ago

Hi, Dan,

I got some entries after running this that look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ERR2570720
chr6H   199256913       .       CA      *,*     36.77   PASS    AC=1,1;AN=2;DP=10;MQ=60;NS=1;RFGQ_ALL=8.06      GT:GQ:DP:MQ:PS:PQ:RFGQ:FT2/0:8:10:60:199256912:21:8.06:PASS

2 alternate * alleles. Downstream, software doesn't like this and it doesn't make sense, how should I modify the above code to avoid this in future?

Cheers,

Max.

dancooke commented 4 years ago

Hmm, what's the record at this site before running the script? This is certainly odd but technically still valid VCF as far as I can tell.

DiDeoxy commented 4 years ago

Interestingly, it is identical before the script. My bad, thought it had something to do with the merging of records.

GLnexus doesn't like it though, somthing about the alternate alleles being non-distinct (have lost the error message).

Would it be appropriate to replace *,* with * ?

DiDeoxy commented 4 years ago

Looking at it closer I think see that there are two dissimilar alleles at this location? Possibly deletions? So it would not make sense to replace. Is removal my only option?

dancooke commented 4 years ago

There's nothing in the VCF spec that indicates ALT alleles must be distinct, and actually there are some cases - although apparently not this one - where it might be helpful to report more than one * ALT allele. I'd report this is an issue to GLnexus.

Can you send a sub-BAM of this region for me to take a look at?

dancooke commented 4 years ago

By the way, I think that you might ultimately hit a dead-end trying to use Octopus + GLnexus for joint calling because Octopus does not report genotype likelihoods that I believe GLnexus requires.

DiDeoxy commented 4 years ago

1kb_subset_ERR2570720.zip

its actually 100 kilobases centered on the allele.

Thats too bad but not cohort calling isn't the biggest deal, mostly I just want to merge the gvcfs and then internally impute using beagle.