Closed dlaehnemann closed 6 years ago
I wam not yet sure that I understand why that would be needed. Could you explain why?
Sure, my current use case is:
I have a candidate.bcf
file that contains single nucleotide variant / genotype calls. In addition to the common calls where at least one alternative allele is present, I also have a lot of sites where no candidate alternative allele is present in the candidate.bcf
, e.g.:
chr1 65446 . C <*> 0 . DP=223;AC=0;AN=2 GT 0/0
As I am not only interested in genotypes at sites with an alternative allele present, but also in genotypes at sites where no alternative allele was present in the set used for candidate generation--in my case a gold standard dataset separate from the actual samples genotyped. So I would like to just run the likelihood model on these sites, as well, to get a benchmarking of the genotyping capabilities that is not biased against homozygous reference sites. Does this make more sense?
And as we have no alternative allele at those sites, we need to create one when we parse these sites from the BCF. To me, a dedicated nucleotide Type based on the IUPAC ambiguity code seems like the cleanest solution--we would just need to create the respective equality operation on this type in rust-bio
, probably here:
https://github.com/rust-bio/rust-bio/blob/dba57123d158e23cc74de3335efab9fbc0309f1a/src/alphabets/dna.rs
I think this should work without any great overhead in parsing and computation?!
As an alternative, we could just generate a random non-reference genotype that we then calculate the likelihoods against, but we might loose read information / Evidence
if we have different deviating reads at a site -- and considering all alternative nucleotides bundled up should give a conservative likelihood for the homozygous reference genotype, which sounds preferable to me.
Mhm, I see. The "clean" solution to me would be to have a fourth variant type called Ref
, without any enclosed data. Would that make sense? I think the modifications would only affect samples.rs
and the changes should be minimal (almost just a | Variant::Ref
in all match expressions).
Ah, OK, yeah, that makes sense, as it will also work for indel data. I'll think about it a bit more and come back with some suggestions what needs to be changed.
A suggestion of how to implement this, is now in branch: https://github.com/PROSIC/libprosic/tree/calls-for-hom-ref-sites
Should I create a pull-request to track further discussion of the actual code?
Sounds good!
To get an unbiased evaluation of libprosic-based calling of single nucleotide variant
Event
s, these should not only be called where the candidates provide a possible alternative allele, but also at sites that are listed in the candidates file, but where no alternative allele is listed.From what I see, this would require explicitly parsing those sites in the
utils.rs
functioncollect_variants()
with some useful base/nucleotide provided to the SNV type (ambiguity code for any other nucleotide than the reference one at that site?) and inmodel/evidence/reads.rs
the functionprob_read_base()
would then need to handle that base/nucleotide.Does this make sense?