samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
641 stars 174 forks source link

SVTYPE description for nested values and multi-allelic variants #142

Open cyenyxe opened 8 years ago

cyenyxe commented 8 years ago

The SVTYPE field in the INFO column is currently described as:

INFO=

Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is useful for filtering.

https://github.com/samtools/hts-specs/pull/130/commits/d2782545da1311ad615e4c6dff6fd2417f92c8a5 brings the possibility of using other types (requested in https://github.com/samtools/hts-specs/issues/89#issuecomment-112119514) but at least the following issues require discussion:

thefferon commented 8 years ago

Nested values: The ALT column accepts things like DUP:TANDEM, but this is not clear for SVTYPE.

My interpretation of the current spec is that subtypes are not allowed in SVTYPE. And I think this serves the purpose of maintaining clear categories in SVTYPE, allowing ALT to provide more detail.

Please clarify:

Supporting multiple SV alternate alleles: The values could match the order of the SV alternates.

Is this meant to refer to the possibility of having subtypes in SVTYPE above? The ALT column is allowed to have multiple values, and order matters. SVTYPE can only have one value, not a list.

Mixing SNP and SV as alternate alleles: Represent only the type of the SV, matching the order as in the previous point? Use missing values for non-SV alternates?

Are you referring only to SNPs and SVs that have the same value for POS? One can imagine having a large deletion on one allele that precludes the possibility of SNPs being reported within the deletion interval on the same allele – is there any check for this?

cyenyxe commented 8 years ago

Let me clarify with some examples (assuming subtypes are not allowed):

1 1000 A <CNV:1>,<CNV:2> SVTYPE=CNV, no issues here.

1 1000 A <INS>,<CNV> I guess that for big populations this could happen, even if rarely. Which SVTYPE would this one have? Would other INFO fields be affected too?

1 1000 A AT,<CNV> Would this only have only SVTYPE=CNV? Not inserting a missing value could make matching the allele index numbers a bit more complex.

The specification doesn't restrict in any way reporting SNPs and SVs in the same POS. It could actually happen when mixing control and case populations, right? Healthy samples could have a SNP or very short INDEL, whereas the cases would have the SV.

atks commented 8 years ago

With regards to the example 1 1000 A ,

In this case, I think the header has to defined and separately, in extreme cases, if you look at the 1000G v5 VCF file, copy numbers can reach up to 124 copies and 124 entries of the symbolic ALT allele have to be defined.

Is it possible to allow ALT symbolic types to contain a quantitive value such that a single entry in the header would allow for all possible copy numbers.

This is not just useful for CNVs, but also for Tandem Repeats.

d-cameron commented 8 years ago

The problem with the spec-defined SV fields isn't limited to just SVTYPE: IMPRECISE, END, CIPOS, CIEND, CILEN all have the same problem so currently, multi-allelic SV variants cannot be unambiguously represented using the spec-defined VCF headers (see #133).

thefferon commented 8 years ago

General point of interest: One alternative solution is to list incompatible / problematic alleles in successive VCF records. The spec states:

POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required)

This won't (cleanly) solve the problem of trying to express a genotype consisting of two non-reference alleles with the same POS, but it bears keeping in mind.

pd3 commented 8 years ago

The CNV:num type may not be always practical. Imagine a site with highly variable copy number, you'd have to list all encountered cases in ALT. The ALT should only state that this is a copy number variation and a custom field, such as FORMAT/CN, can give the actual copy number.

cyenyxe commented 8 years ago

Unfortunately, even if inconvenient it's something people are already using. To continue with the 1kG example, I have extracted this variant from Phase 3 v3:

7 82951 esv3611748;esv3611749 G <CNV:0>,<CNV:2>

All the CNV combinations are properly declared in the header. There are no limitations from the specification itself to use just numbers in an ALT ID "sub-value". So at the moment there are at least two valid ways to represent this case, which can be a problem for further file processing.

thefferon commented 8 years ago

Clarification: The VCF I have seen for 1000 Genomes copy number variants lists SVTYPE as "CNV" but ALT alleles as "," "," "," etc., often listing several ALTs separated by commas - e.g., ",,,,". This differs from the examples above, which list ALTs as "," "," etc. @cyenyxe, have you seen examples of the latter in VCF? It may be important to keep the syntax right for this discussion.

atks commented 8 years ago

@thefferon - is for v5 of the 1000 Genomes VCF file. is for v3.