samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
647 stars 174 forks source link

Modify VCF to support GRC assembly model (GRCh37, GRCh38, GRCm38) #51

Open deannachurch opened 9 years ago

deannachurch commented 9 years ago

There are a couple of limitations to the current VCF that make it difficult to fully represent data using the full GRC assemblies, GRCh37, GRCh38 and GRCm38 specifically. These are:

This issue was discussed at a workshop put on by the GRC at Genome Informatics 2014 and there are a series of proposals we'd like to put forth. A set of coherent examples can be found here: http://www.slideshare.net/GenomeRef/variant-calling-ii

##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT, id=GL000258.2, assoc_id=CM000679.2, reg=45309498-46836265>
##INFO=<ID=ALTLOCS, Number=.,Type=String,Description=“A list of the alternate
loci in the reference genome that are associated with this locus”>

##INFO=<ID=ALTHAPS, Number=.,Type=String,Description=“A list of the known
haplotypes that are associated with this locus”>

##FORMAT=<ID=HT,Number=1,Type=String,Description=“Haplotype combination based on ALTHAPS">

There may be additional improvements/suggestions that can be made, but these seem like a reasonable start. Making this types of modifications will be an important part of helping groups migrate to GRCh38.

lh3 commented 9 years ago

On the representation of alt contigs, I think we should develop a best practice before modifying the spec. What is the intended output from variant callers? Is it practical for callers to generate such output? How downstream tools are supposed to use the vcf?

Specifically, you proposed to add HT, but in my experience, alt contigs frequently recombine with each other, which makes the tag not applicable most of times. In addition, how are we supposed to use ALTLOCS? If we know a locus overlapping an alt contig, what can we do with it?

We will be clearer about the answers and then develop the right spec when more researchers get experiences on h38. Tools determine the adoption of alt contigs. It is not urgent to change the spec.

deannachurch commented 9 years ago

I think this is a bit of a chicken and egg problem. If we want variant callers to be able to use the Alt loci, we need to be able to express the variants in VCF. This doesn't work well with the current spec (see how dbSNP distributes data).

I think the issue is, there are multiple ways to use VCF- it is just a reporting tool. dbSNP uses it to dump data- so you want to report all genomic contexts for a given SNV. An argument could be made that in the context of an individual genome, you may only want to report one context for a SNP- but how do you handle that when you have multiple samples in the VCF? I fear that decision making will be hard. I agree the trying to define some best practices is useful. To attempt to address some specific issues:

This is really meant to start the discussion about how we want to represent variation on GRCh38. It will be good to have some concrete examples.