Open mparker2 opened 4 months ago
maybe it would be better to use ref in this case, and reserve missing for SNPs in "merisyntenic" (am I using that right? 😄) regions for accessions which are not alignable in that region?
Hm, the VCF filtering was implemented quite a while ago, don't remember all our thought processes clearly.
I think the behaviour right now is to just copy across the small variant records present for a sample, and fill any sample for which that record does not exist with .
, because we don't have the call information for it.
But yes, as we have assemblies we can be reasonably sure that as long as it aligns if there were a non-ref allele it would be called, so could impute it to have the reference genotype.
Actually think that would make sense, at least as behaviour that can be enabled via a flag. Would that be useful to you? Then I can see about implementing something, shouldn't be much effort.
The general bioinfo software community might move away from .
completely : https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531-GenotypeGVCFs-and-the-death-of-the-dot
Yes I think it would be useful. Currently I am just using msyd to identify variants in core syntenic regions and then using bcftools to fill in the missing genotypes with the reference allele, which is easy enough. But it would not make sense to impute the reference allele outside core-syntenic regions, where the sequence is actually truly absent from some accessions
I think if you were to stop using "." entirely like in the post Manish linked to, then an extra format field would be required to distinguish true reference alleles from non-syntenic variants. I'd be inclined to keep the "." for the latter though 😃
haha some greats puns in there :D I think the VCF people are in general looking to change some stuff to make VCFs scale better for many samples. But TBH I would stay with the current standard until that stuff is standardized and well established, also because the way SVs are encoded might be changed.
But it would not make sense to impute the reference allele outside core-syntenic regions, where the sequence is actually truly absent from some accessions
Yes, I think being able to differentiate syntenic from non-syntenic SNPs is the main advantage of the VCF-filtering of msyd! I think in principle it can be imputed for any merisyntenic regions including the reference (in non-ref haplotypes we don't currently call small variants). Would you think it makes more sense to have one switch enabling this behaviour for all ref-syntenic regions, or a separate one for coresyn and ref-merisyn?
to me it makes more sense to have the one option but I'll let you decide! 😄 thank you both for the work on msyd, I'm finding it very useful!
Have something that works for coresyn regions up on the main branch (which I think was your usecase anyway @mparker2 ?).
For merisyntenic regions, the imputation should be possible with the same principle, but is causing some issues – also, the code could do with some cleanup since its now >1y old. Will be looking into that in the next days over on the vcf
branch, as that might involve some breaking changes I don't want to have on main.
Ah, just noticed I typed this a while ago ago but never pressed send:
Have implemented it in a way that hopefully makes sense, by copying from the reference assigned to that multisyn record for all syntenic orgs when there is no data – that means missing data should be kept if it is explicitly specified, if i am thinking this through correctly.
To enable, the --impute
flag can be passed.
The changes are up in the main branch now and seemed to work fine for me when testing – feel free to play around wtih them :D
@lrauschning a more conceptual question for you... was there a conscious decision to have samples that match the reference in a core syntenic region labelled as missing rather than as ref? For example:
In this region where all 5 genomes are syntenic but rubezhnoe1 has three SNPs, can't the other accessions be marked as having the reference allele using "0", rather than using "." which is usually used to signify insufficient information to make a call?
BW Matt