Open cyenyxe opened 8 years ago
Nested values: The ALT column accepts things like DUP:TANDEM, but this is not clear for SVTYPE.
My interpretation of the current spec is that subtypes are not allowed in SVTYPE. And I think this serves the purpose of maintaining clear categories in SVTYPE, allowing ALT to provide more detail.
Please clarify:
Supporting multiple SV alternate alleles: The values could match the order of the SV alternates.
Is this meant to refer to the possibility of having subtypes in SVTYPE above? The ALT column is allowed to have multiple values, and order matters. SVTYPE can only have one value, not a list.
Mixing SNP and SV as alternate alleles: Represent only the type of the SV, matching the order as in the previous point? Use missing values for non-SV alternates?
Are you referring only to SNPs and SVs that have the same value for POS? One can imagine having a large deletion on one allele that precludes the possibility of SNPs being reported within the deletion interval on the same allele – is there any check for this?
Let me clarify with some examples (assuming subtypes are not allowed):
1 1000 A <CNV:1>,<CNV:2>
SVTYPE=CNV, no issues here.
1 1000 A <INS>,<CNV>
I guess that for big populations this could happen, even if rarely. Which SVTYPE would this one have? Would other INFO fields be affected too?
1 1000 A AT,<CNV>
Would this only have only SVTYPE=CNV? Not inserting a missing value could make matching the allele index numbers a bit more complex.
The specification doesn't restrict in any way reporting SNPs and SVs in the same POS. It could actually happen when mixing control and case populations, right? Healthy samples could have a SNP or very short INDEL, whereas the cases would have the SV.
With regards to the example
1 1000 A
In this case, I think the header has to defined
Is it possible to allow ALT symbolic types to contain a quantitive value such that a single entry in the header would allow for all possible copy numbers.
This is not just useful for CNVs, but also for Tandem Repeats.
The problem with the spec-defined SV fields isn't limited to just SVTYPE: IMPRECISE, END, CIPOS, CIEND, CILEN all have the same problem so currently, multi-allelic SV variants cannot be unambiguously represented using the spec-defined VCF headers (see #133).
General point of interest: One alternative solution is to list incompatible / problematic alleles in successive VCF records. The spec states:
POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required)
This won't (cleanly) solve the problem of trying to express a genotype consisting of two non-reference alleles with the same POS, but it bears keeping in mind.
The CNV:num type may not be always practical. Imagine a site with highly variable copy number, you'd have to list all encountered cases in ALT. The ALT should only state that this is a copy number variation and a custom field, such as FORMAT/CN, can give the actual copy number.
Unfortunately, even if inconvenient it's something people are already using. To continue with the 1kG example, I have extracted this variant from Phase 3 v3:
7 82951 esv3611748;esv3611749 G <CNV:0>,<CNV:2>
All the CNV combinations are properly declared in the header. There are no limitations from the specification itself to use just numbers in an ALT ID "sub-value". So at the moment there are at least two valid ways to represent this case, which can be a problem for further file processing.
Clarification:
The VCF I have seen for 1000 Genomes copy number variants lists SVTYPE as "CNV" but ALT alleles as "
@thefferon -
The SVTYPE field in the INFO column is currently described as:
https://github.com/samtools/hts-specs/pull/130/commits/d2782545da1311ad615e4c6dff6fd2417f92c8a5 brings the possibility of using other types (requested in https://github.com/samtools/hts-specs/issues/89#issuecomment-112119514) but at least the following issues require discussion: