samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
640 stars 174 forks source link

Filtering *alleles* in a multi allelic vcf record #115

Open yfarjoun opened 8 years ago

yfarjoun commented 8 years ago

Hello,

In large cohorts (~100K) multi-allelic sites are very common. Furthermore, good rare alleles can share the locus of bad, "common" variants (they are common since they seem to appear in many individuals, but upon careful examination they are most likely artifacts of one sort or another). The current VCF spec only allows for the filtering or passing of a variant site, and while I'm aware of discussions such as #77 which might affect the conclusions, I would like to open the filtering aspect for discussion here.

We are currently implementing allelic-specific filtering as an INFO annotation (AS_FILTER) with count A (one per alternative allele). This is ad-hoc and we would rather use something else that the community agrees on.

Our current proposal is to allow the FILTER field to contain a list of filter sets, one per alt-allele. These will be comma (,) separated (the sets remain semi-colon (;) separated). A FILTER field would therefore either contain a single set, or one set per alt-allele.

This change is not backwards-compatible since it implies that commas are no-longer allowed as part of the filter string, so an old VCF with commas in the filter name would be parsed incorrectly.

One objection might be that the genotype will be unclear if one of the alleles is filtered and the other is not, and while I agree that this is an unfortunate possible outcome, it is already allowed with the spanning deletion <*> symbolic allele, which refers to an upstream deletion (that might be filtered). A downstream tool could be written that will re-genotype in an filter-aware manner. This will also be important for users who choose a different sensitivity/specificity threshold, possibly leading to different filtration, and thus different genotyping.

I'm hoping to hear whether other users might find this useful, if there are other implementations that should be considered and also possible problems that this could cause.

Thanks!

Yossi.

ekg commented 8 years ago

Just as a point of reference, vcflib -f implements techniques to handle filtering the multiallelic cases. A dot is used to specify an allele that has been removed in an particular genotype. The FILTER field is not handled. I'd be happy to bring vcflib in line with what the group decides. On Dec 19, 2015 9:08 PM, "Yossi Farjoun" notifications@github.com wrote:

Hello,

In large cohorts (~100K) multi-allelic sites are very common. Furthermore, good rare alleles can share the locus of bad, "common" variants (they are common since they seem to appear in many individuals, but upon careful examination they are most likely artifacts of one sort or another). The current VCF spec only allows for the filtering or passing of a variant site, and while I'm aware of discussions such as #77 https://github.com/samtools/hts-specs/issues/77 which might affect the conclusions, I would like to open the filtering aspect for discussion here.

We are currently implementing allelic-specific filtering as an INFO annotation (AS_FILTER) with count A (one per alternative allele). This is ad-hoc and we would rather use something else that the community agrees on.

Our current proposal is to allow the FILTER field to contain a list of filter sets, one per alt-allele. These will be comma (,) separated (the sets remain semi-colon (;) separated). A FILTER field would therefore either contain a single set, or one set per alt-allele.

This change is not backwards-compatible since it implies that commas are no-longer allowed as part of the filter string, so an old VCF with commas in the filter name would be parsed incorrectly.

One objection might be that the genotype will be unclear if one of the alleles is filtered and the other is not, and while I agree that this is an unfortunate possible outcome, it is already allowed with the spanning deletion <*> symbolic allele, which refers to an upstream deletion (that might be filtered). A downstream tool could be written that will re-genotype in an filter-aware manner. This will also be important for users who choose a different sensitivity/specificity threshold, possibly leading to different filtration, and thus different genotyping.

I'm hoping to hear whether other users might find this useful, if there are other implementations that should be considered and also possible problems that this could cause.

Thanks!

Yossi.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/115.

ekg commented 8 years ago

Sorry, I meant vcflib's vcffilter -f. On Dec 19, 2015 10:35 PM, "Erik Garrison" erik.garrison@gmail.com wrote:

Just as a point of reference, vcflib -f implements techniques to handle filtering the multiallelic cases. A dot is used to specify an allele that has been removed in an particular genotype. The FILTER field is not handled. I'd be happy to bring vcflib in line with what the group decides. On Dec 19, 2015 9:08 PM, "Yossi Farjoun" notifications@github.com wrote:

Hello,

In large cohorts (~100K) multi-allelic sites are very common. Furthermore, good rare alleles can share the locus of bad, "common" variants (they are common since they seem to appear in many individuals, but upon careful examination they are most likely artifacts of one sort or another). The current VCF spec only allows for the filtering or passing of a variant site, and while I'm aware of discussions such as #77 https://github.com/samtools/hts-specs/issues/77 which might affect the conclusions, I would like to open the filtering aspect for discussion here.

We are currently implementing allelic-specific filtering as an INFO annotation (AS_FILTER) with count A (one per alternative allele). This is ad-hoc and we would rather use something else that the community agrees on.

Our current proposal is to allow the FILTER field to contain a list of filter sets, one per alt-allele. These will be comma (,) separated (the sets remain semi-colon (;) separated). A FILTER field would therefore either contain a single set, or one set per alt-allele.

This change is not backwards-compatible since it implies that commas are no-longer allowed as part of the filter string, so an old VCF with commas in the filter name would be parsed incorrectly.

One objection might be that the genotype will be unclear if one of the alleles is filtered and the other is not, and while I agree that this is an unfortunate possible outcome, it is already allowed with the spanning deletion <*> symbolic allele, which refers to an upstream deletion (that might be filtered). A downstream tool could be written that will re-genotype in an filter-aware manner. This will also be important for users who choose a different sensitivity/specificity threshold, possibly leading to different filtration, and thus different genotyping.

I'm hoping to hear whether other users might find this useful, if there are other implementations that should be considered and also possible problems that this could cause.

Thanks!

Yossi.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/115.

pd3 commented 8 years ago

The FILTER column was intended as an indicator that a site is truly polymorphic, rather than making assertions about the alleles. The meaning would be changed by this, all the programs and scripts which check for FILTER being equal to PASS would start removing wrong sites. An allele-specific filter in the INFO field would be good though!

dgmacarthur commented 8 years ago

@pd3 - we agree that this would break many existing tools. However, I do think it's worth considering adopting this as a way to encourage a shift from a site-centric to an allele-centric view in VCF.

Increasing sample sizes mean that issues associated with multiallelic sites are pervasive - in ExAC, with 60,000 exomes, around 7% of all sites are multiallelic.

Regardless of the original intent of the spec, it seems perfectly natural to me to have the FILTER field reflect allele-wise status; issues like allele ordering are already well-addressed. I suspect that if we'd been facing a VCF with 60,000 samples when the VCF spec was defined this is the approach we would have taken.

Given this problem is only going to get worse I'm in support of @yfarjoun's proposal: let's accept that this will break some pipelines, but pursue it for the sake of future projects with hundreds of thousands of samples.

lfrancioli commented 8 years ago

I also support @yfarjoun proposal as having filters both in the FILTER and INFO columns isn't very practical. I find @ekg comment interesting since I feel like the issue of how to handle genotypes at a multi allelic site that where some alleles are filtered and other ones aren't should also be tackled properly.

ekg commented 8 years ago

I said that I would implement what the group decides--- but I have to admit that it is very low priority to do so given the existence of a straightforward way to annotate and filter single alleles. I hope to maintaining VCF tools only as needed to support existing work. To explain:

@dgmacarthur

Increasing sample sizes mean that issues associated with multiallelic sites are pervasive - in ExAC, with 60,000 exomes, around 7% of all sites are multiallelic.

You don't need to have 60k exomes to have this problem. Approaches that don't handle multiallelic sites would have broken the 1000G project. In working with multiallelic sites I would annotate records with allele-specific qualities and allele-specific filters. So there is a clear way to work with this kind of data that doesn't require any change to the way the FILTER field works.

However, I do think it's worth considering adopting this as a way to encourage a shift from a site-centric to an allele-centric view in VCF.

Because VCF is semi-tabular and not built on top of a consistent structured format (XML or JSON), what should be backwards-compatible extensions of its semantics can become breaking changes. The columnar bits of the records can't be touched without risking existing implementations.

If you want to focus on alleles, make your records biallelic or define the INFO fields with Number=R or Number=A. This is well-supported.

for the sake of future projects with hundreds of thousands of samples

Let's not kick this whale up the beach any harder than we have to. I hope, for our sake, that the world never sees a VCF file with hundreds of thousands of samples in it.

How do you query a structure like that? In VCF, anything other than positional queries are extremely expensive.

How do you represent genotypes? When including deletions and other CNVs, most sites are multiallelic, so even the definition of a site is made more complicated. We will need to decide when to break them into smaller pieces.

I have been frustrated by these issues. Rather than try to resolve them by increasing the complexity of VCF, I have developed a graph based representation for collections of genomes and am working on methods to do resequencing analysis using this structure as a basis.

@yfarjoun

A downstream tool could be written that will re-genotype in an filter-aware manner. This will also be important for users who choose a different sensitivity/specificity threshold, possibly leading to different filtration, and thus different genotyping.

It's already possible to do this. In fact, that was a core feature of the pipeline we put mulitallelics and indels through in the 1000G. By filtering a particular allele with vcffilter -f, genotype likelihoods are modified to reflect the new set of alleles. Then, a genotyping tool (we used MVNCall) was applied to the filtered set of genotypes. You could also re-genotype based on the maximum-likelihood genotype, such as with vcfglxgt.

Adding a new set of semantics for a particular column isn't necessary.

ekg commented 8 years ago

@yfarjoun

We are currently implementing allelic-specific filtering as an INFO annotation (AS_FILTER) with count A (one per alternative allele). This is ad-hoc and we would rather use something else that the community agrees on.

So what I'm saying is that this doesn't seem ad-hoc to me :)

Keeping things ad-hoc and then standardizing them looks like the smooth way to handle this without the risk of breaking anything.

lh3 commented 8 years ago

While I agree allele-centric view is very important, I think compatibility is a much bigger deal. I also prefer to standardize a new INFO field for allele-specific filtering.

Rohit-Satyam commented 9 months ago

Is this being followed up on priority? Did you guys standardized this ?

d-cameron commented 9 months ago

This has not been standardised and is not currently a priority. If there has been movement in this regard (e.g. tools in the wild that use non-standardised fields allele-specific filtering) then we can raise the priority but it is currently buried among the many other lower priority issues.