samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
642 stars 174 forks source link

Enforcing a standard for reporting REF and ALT allele depths #78

Closed arq5x closed 9 years ago

arq5x commented 9 years ago

The current VCF specification does a very admirable job of defining the standards by which variant callers should report genotypes (GT), genotype likelihoods (GL), and overall observed depth (DP) for each sample at a given variant site.

However, in my opinion, the specification is lacking in that it fails to define a standard for reporting the sequencing depths observed for each sample and allele. This lack of specificity has allowed variant callers to each define their own convention. Consequently, downstream tools are left to write custom code that attempts to handle the rules defined by each variant caller.

For instance, GATK uses AD, which reports allele depths as a comma-separated list (173 for the reference and 141 for the alternate in this example:

chr1    873762  .       T   G   100 .    AC=1;AF=0.50;DP=315 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255

Freebayes, however uses two different tags, RO and AO for reference and alternate depth, respectively.

GT:GQ:DP:RO:QR:AO:QA:GL 0/0:0:5:5:132:0:0:0,-0.32169,-4.34301

Other tools such as VARSCAN and Platypus employ yet other conventions.

Given that the inspection of allele depths are are crucial to quality control and data interpretation, I think it would be best to define a standard for this. In my opinion, the GATK convention makes most sense, as it works for single or multiple alternate alleles, so long as the reference allele is always reported first.

The standard could allow support for either reporting the total depths for each allele:

GT:AD  0/1:173,141

or optionally allow another delimiter (; ?) defining the depth on each strand (positive first) for each allele:

GT:AD  0/1:100;73,70;71

It would be great if we could make progress towards a standard here, as it would prevent hideous code such as: https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx#L264-L295

Best, Aaron Quinlan USTAR Center for Genetic Discovery Department of Human Genetics and Biomedical Informatics University of Utah

lh3 commented 9 years ago

I have to deal with this DP issue in my review last year. It was a big headache. I had to treat the VCF generated by each SNP caller as a different format and parses it conditionally (see this code block). This means that script won't work with new SNP callers or when the existing SNP callers change their tags.

As to the proposal, I also prefer to get counts on both strands. Nonetheless, for strand-specific counts, we'd better not reuse AD. Adding ";" will also turn AD from an array to a string, making it less efficient. How about adding BC (base counts), which gives an array of forward and reverse counts, in the order of #refAlleleFor, #refAlleleRev, #allele1For, #allele1Rev, ...? Alternatively, we can separate BC to BCF and BCR for forward/reverse counts only?

atks commented 9 years ago

Base counts sound specific to SNPs, why not AD, ADF and ADR?

arq5x commented 9 years ago

@lh3 - yep, that sounds good, but I do agree with @atks that base counts is too specific for general alleles. How about ADS "allele depths, stranded" or SAD "stranded allele depths"?

lh3 commented 9 years ago

I agree that "BC" is not a good name. I have no preference over ADF+ADR vs ADS. I'd just like to see SNP callers all use the same tag(s) to report read depth.

mcshane commented 9 years ago

I think you should keep the forward and reverse counts in separate tags so you can use Number=A or Number=R in the header rather than having things of length 2A or 2R. As another datapoint, samtools uses DPR for depth per-allele including reference (R for Number=R).

arq5x commented 9 years ago

Good point, @mcshane. So, you propose keep AD as is (i.e., Number=R, comma-separated list of total depth for each allele). Plus, establiish new Number=R tags for depths on the forward (ADF) and reverse (ADR) strands? I like that.

mcshane commented 9 years ago

Agree AD/ADF/ADR reads better than DPR. These fields tend to blow out the size of VCF/BCFs quite significantly for large numbers of samples.

ldgauthier commented 9 years ago

Obviously from the GATK perspective, keeping AD the same is the easiest for us. We don't currently use the forward and reverse allele depths, but users might find that informative.

pd3 commented 9 years ago

So, the current proposal is to reserve either the following format keys:

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADS,Number=R,Type=Float,Description="Allele dosage on fwd and rev strand">

or these

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd strand">
##FORMAT=<ID=ADR,Number=R,Type=Float,Description="Allele dosage on rev strand">

From the two variants I prefer the first because it is more space-efficient and if a stranded version is needed, one always wants to see the depth on both strands.

EDIT:

yfarjoun commented 9 years ago

@pd3: I'm assuming you mean "Allele dosage on rev strand"

Also, could you explain how ADS would work since we would need 2R floats for it.

Would be be enough to have

FORMAT=

FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd

strand">

and infer the rev count from the difference? (if we want to be space efficient)

On Tue, May 19, 2015 at 9:11 AM, pd3 notifications@github.com wrote:

So, the current proposal is to reserve either the following format keys:

FORMAT=

FORMAT=

or these

FORMAT=

FORMAT=

FORMAT=

From the two variants I prefer the first because it is more space-efficient and if a stranded version is needed, one always wants to see the depth on both strands.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/78#issuecomment-103479620.

pd3 commented 9 years ago

I did not read the thread above properly, it seemed to suggest that @mcshane was proposing the ADS tag by referencing it from an unrelated pull request. I also missed the Number=2R concern. Since the specification has only Number=R fields, I agree it is practical to go with ADF and ADR tags so that allele trimming and sanity checking works.

mcshane commented 9 years ago

The proposal would be for something like

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">
ldgauthier commented 9 years ago

+1

pd3 commented 9 years ago

Can we extend this proposal to include the total depths in INFO similarly to FORMAT?

##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=ADF,Number=R,Type=Integer,Description="Total allelic depths on the forward strand">
##INFO=<ID=ADR,Number=R,Type=Integer,Description="Total allelic depths on the reverse strand">
eitanbanks commented 9 years ago

+1

lh3 commented 9 years ago

+1 to both INFO and FORMAT.

arq5x commented 9 years ago
+1 to both INFO and FORMAT.

Agreed.

atks commented 9 years ago

+1 for FORMAT. not sure for INFO.

pd3 commented 9 years ago

@atks samtools already uses two flavours of this annotation (INFO/DP4,DPR), I think it makes sense to standardize both the FORMAT and the INFO columns.

I believe there is a consensus about this, so I am closing this thread. Please reopen if not.

ekg commented 9 years ago

I am not able to implement this easily due to the fact that Number=R is not something that can be dumped into tabular format. It also breaks filtering.

So I am reticent. But there may be a way to resolve it.

Wish I had seen this earlier. Freebayes uses QA, QR, AO, and RO for quality sums and allele depth.

ekg commented 9 years ago

Well it looks like there is strong consensus about this :)

So what about quality sums? AQ?