samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
646 stars 174 forks source link

VCF/BCF phasing is not compatible with ploidy>2 #298

Open yfarjoun opened 6 years ago

yfarjoun commented 6 years ago

The spec discusses phased genotypes in a manner that seems to be agnostic to ploidy, but there seems to be an implicit plody=2 assumption in the VCF spec.

The following situation seems to be impossible to encode: ploidy=3, two variants, three different alleles in each:

variant 1: Genotype has alleles A/B/C variant 2: Genotype has alleles D/E/F

Scenario 1: B and E are phased, but the other alleles are unknown. Scenario 2: all three alleles from V1 are phased with the alleles in V2 (respectively)

I can only think of A|B|C and E|D|F as the "correct" VCF encoding for both of these scenarios. Theoretically, one might be able to reorder and write scenario 1 as B|A/C and E|D/F, but that might not be possible due to an upstream phasing of one of the other alleles.

in BCF, this can be encoded since the spec adds 1 to the encoding of each allele if it is phased.

This issue is meant to serve as a discussion point for possible solutions or clarifications that we might want to make.

ctsa commented 6 years ago

One proposal to get things started:

In the VCF GT field, consider the vertical bar following each allele to be equivalent to the BCF allele +1 flag, thus a trailing bar would become possible in this field, such as D/E/F|. For back-compatibility, the final bar is optional (or disallowed?) when all other alleles in the genotype are phased, so the interpretation of all existing VCF GT fields is unchanged.

yfarjoun commented 6 years ago

I like this idea!

pd3 commented 6 years ago

I like the proposal as well, with making the final bar optional rather than disallowed.

EDIT: Actually, I don't mind if it is disallowed, I can imagine the trailing bars will break many tools and it does not add any information.

d-cameron commented 5 years ago

While we are it, phasing information is even less well defined for somatic genomes with copy number change. Even if the tool can determine which additional copies are maternal and paternal, the specs don't have a way to encode this information. Furthermore, when the copy number gets very high (e.g. 100+), we get a combinatoric explosion in the GL field.