Closed arq5x closed 9 years ago
I have to deal with this DP issue in my review last year. It was a big headache. I had to treat the VCF generated by each SNP caller as a different format and parses it conditionally (see this code block). This means that script won't work with new SNP callers or when the existing SNP callers change their tags.
As to the proposal, I also prefer to get counts on both strands. Nonetheless, for strand-specific counts, we'd better not reuse AD
. Adding ";" will also turn AD from an array to a string, making it less efficient. How about adding BC
(base counts), which gives an array of forward and reverse counts, in the order of #refAlleleFor, #refAlleleRev, #allele1For, #allele1Rev, ...? Alternatively, we can separate BC
to BCF
and BCR
for forward/reverse counts only?
Base counts sound specific to SNPs, why not AD, ADF and ADR?
@lh3 - yep, that sounds good, but I do agree with @atks that base counts is too specific for general alleles. How about ADS
"allele depths, stranded" or SAD
"stranded allele depths"?
I agree that "BC" is not a good name. I have no preference over ADF+ADR vs ADS. I'd just like to see SNP callers all use the same tag(s) to report read depth.
I think you should keep the forward and reverse counts in separate tags so you can use Number=A
or Number=R
in the header rather than having things of length 2A
or 2R
. As another datapoint, samtools uses DPR
for depth per-allele including reference (R for Number=R
).
Good point, @mcshane. So, you propose keep AD
as is (i.e., Number=R
, comma-separated list of total depth for each allele). Plus, establiish new Number=R
tags for depths on the forward (ADF
) and reverse (ADR
) strands? I like that.
Agree AD
/ADF
/ADR
reads better than DPR
. These fields tend to blow out the size of VCF/BCFs quite significantly for large numbers of samples.
Obviously from the GATK perspective, keeping AD the same is the easiest for us. We don't currently use the forward and reverse allele depths, but users might find that informative.
So, the current proposal is to reserve either the following format keys:
##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADS,Number=R,Type=Float,Description="Allele dosage on fwd and rev strand">
or these
##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd strand">
##FORMAT=<ID=ADR,Number=R,Type=Float,Description="Allele dosage on rev strand">
From the two variants I prefer the first because it is more space-efficient and if a stranded version is needed, one always wants to see the depth on both strands.
EDIT:
@pd3: I'm assuming you mean "Allele dosage on rev strand"
Also, could you explain how ADS would work since we would need 2R floats for it.
Would be be enough to have
strand">
and infer the rev count from the difference? (if we want to be space efficient)
On Tue, May 19, 2015 at 9:11 AM, pd3 notifications@github.com wrote:
So, the current proposal is to reserve either the following format keys:
FORMAT=
FORMAT=
or these
FORMAT=
FORMAT=
FORMAT=
From the two variants I prefer the first because it is more space-efficient and if a stranded version is needed, one always wants to see the depth on both strands.
— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/78#issuecomment-103479620.
I did not read the thread above properly, it seemed to suggest that @mcshane was proposing the ADS tag by referencing it from an unrelated pull request. I also missed the Number=2R concern. Since the specification has only Number=R fields, I agree it is practical to go with ADF and ADR tags so that allele trimming and sanity checking works.
The proposal would be for something like
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">
+1
Can we extend this proposal to include the total depths in INFO similarly to FORMAT?
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=ADF,Number=R,Type=Integer,Description="Total allelic depths on the forward strand">
##INFO=<ID=ADR,Number=R,Type=Integer,Description="Total allelic depths on the reverse strand">
+1
+1 to both INFO and FORMAT.
+1 to both INFO and FORMAT.
Agreed.
+1 for FORMAT. not sure for INFO.
@atks samtools already uses two flavours of this annotation (INFO/DP4,DPR), I think it makes sense to standardize both the FORMAT and the INFO columns.
I believe there is a consensus about this, so I am closing this thread. Please reopen if not.
I am not able to implement this easily due to the fact that Number=R
is not something that can be dumped into tabular format. It also breaks filtering.
So I am reticent. But there may be a way to resolve it.
Wish I had seen this earlier. Freebayes uses QA, QR, AO, and RO for quality sums and allele depth.
Well it looks like there is strong consensus about this :)
So what about quality sums? AQ?
The current VCF specification does a very admirable job of defining the standards by which variant callers should report genotypes (GT), genotype likelihoods (GL), and overall observed depth (DP) for each sample at a given variant site.
However, in my opinion, the specification is lacking in that it fails to define a standard for reporting the sequencing depths observed for each sample and allele. This lack of specificity has allowed variant callers to each define their own convention. Consequently, downstream tools are left to write custom code that attempts to handle the rules defined by each variant caller.
For instance, GATK uses
AD
, which reports allele depths as a comma-separated list (173 for the reference and 141 for the alternate in this example:Freebayes, however uses two different tags,
RO
andAO
for reference and alternate depth, respectively.Other tools such as VARSCAN and Platypus employ yet other conventions.
Given that the inspection of allele depths are are crucial to quality control and data interpretation, I think it would be best to define a standard for this. In my opinion, the GATK convention makes most sense, as it works for single or multiple alternate alleles, so long as the reference allele is always reported first.
The standard could allow support for either reporting the total depths for each allele:
or optionally allow another delimiter (
;
?) defining the depth on each strand (positive first) for each allele:It would be great if we could make progress towards a standard here, as it would prevent hideous code such as: https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx#L264-L295
Best, Aaron Quinlan USTAR Center for Genetic Discovery Department of Human Genetics and Biomedical Informatics University of Utah