varlociraptor / varlociraptor

Flexible, uncertainty-aware variant calling with parameter free filtration via FDR control.
https://varlociraptor.github.io
GNU General Public License v3.0
118 stars 16 forks source link

calculate and emit `Event` likelihoods for sites without candidate alt allele #18

Closed dlaehnemann closed 6 years ago

dlaehnemann commented 6 years ago

To get an unbiased evaluation of libprosic-based calling of single nucleotide variant Events, these should not only be called where the candidates provide a possible alternative allele, but also at sites that are listed in the candidates file, but where no alternative allele is listed.

From what I see, this would require explicitly parsing those sites in the utils.rs function collect_variants() with some useful base/nucleotide provided to the SNV type (ambiguity code for any other nucleotide than the reference one at that site?) and in model/evidence/reads.rs the function prob_read_base() would then need to handle that base/nucleotide.

Does this make sense?

johanneskoester commented 6 years ago

I wam not yet sure that I understand why that would be needed. Could you explain why?

dlaehnemann commented 6 years ago

Sure, my current use case is:

I have a candidate.bcffile that contains single nucleotide variant / genotype calls. In addition to the common calls where at least one alternative allele is present, I also have a lot of sites where no candidate alternative allele is present in the candidate.bcf, e.g.:

chr1    65446   .       C       <*>     0       .       DP=223;AC=0;AN=2       GT      0/0

As I am not only interested in genotypes at sites with an alternative allele present, but also in genotypes at sites where no alternative allele was present in the set used for candidate generation--in my case a gold standard dataset separate from the actual samples genotyped. So I would like to just run the likelihood model on these sites, as well, to get a benchmarking of the genotyping capabilities that is not biased against homozygous reference sites. Does this make more sense?

And as we have no alternative allele at those sites, we need to create one when we parse these sites from the BCF. To me, a dedicated nucleotide Type based on the IUPAC ambiguity code seems like the cleanest solution--we would just need to create the respective equality operation on this type in rust-bio, probably here: https://github.com/rust-bio/rust-bio/blob/dba57123d158e23cc74de3335efab9fbc0309f1a/src/alphabets/dna.rs

I think this should work without any great overhead in parsing and computation?!

As an alternative, we could just generate a random non-reference genotype that we then calculate the likelihoods against, but we might loose read information / Evidence if we have different deviating reads at a site -- and considering all alternative nucleotides bundled up should give a conservative likelihood for the homozygous reference genotype, which sounds preferable to me.

johanneskoester commented 6 years ago

Mhm, I see. The "clean" solution to me would be to have a fourth variant type called Ref, without any enclosed data. Would that make sense? I think the modifications would only affect samples.rs and the changes should be minimal (almost just a | Variant::Ref in all match expressions).

dlaehnemann commented 6 years ago

Ah, OK, yeah, that makes sense, as it will also work for indel data. I'll think about it a bit more and come back with some suggestions what needs to be changed.

dlaehnemann commented 6 years ago

A suggestion of how to implement this, is now in branch: https://github.com/PROSIC/libprosic/tree/calls-for-hom-ref-sites

Should I create a pull-request to track further discussion of the actual code?

johanneskoester commented 6 years ago

Sounds good!