why are ref alleles marked as missing?

mparker2 commented 4 months ago

@lrauschning a more conceptual question for you... was there a conscious decision to have samples that match the reference in a core syntenic region labelled as missing rather than as ref? For example:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ms0 rubezhnoe1  kar1    db1 tsu0
Chr1    531 CORESYN1    A   <CORESYN>   .   .   END=426900  SYN:CHR:START:END:AI    1:Chr1:5472:430740:99   1:Chr1:1970:424373:97   1:Chr1:4349:436442:97   1:Chr1:1594:431816:98   1:Chr1:3807:456551:92
Chr1    575 SNP1300 C   T   .   .   PID=1   GT:CHR:START:END    .:....:.:.  1:Chr1:2014:2014    .:....:.:.  .:....:.:.  .:....:.:.
Chr1    597 SNP1301 C   T   .   .   PID=1   GT:CHR:START:END    .:....:.:.  1:Chr1:2036:2036    .:....:.:.  .:....:.:.  .:....:.:.
Chr1    603 SNP1302 C   A   .   .   PID=1   GT:CHR:START:END    .:....:.:.  1:Chr1:2042:2042    .:....:.:.  .:....:.:.  .:....:.:.

In this region where all 5 genomes are syntenic but rubezhnoe1 has three SNPs, can't the other accessions be marked as having the reference allele using "0", rather than using "." which is usually used to signify insufficient information to make a call?

BW Matt

mparker2 commented 4 months ago

maybe it would be better to use ref in this case, and reserve missing for SNPs in "merisyntenic" (am I using that right? 😄) regions for accessions which are not alignable in that region?

lrauschning commented 4 months ago

Hm, the VCF filtering was implemented quite a while ago, don't remember all our thought processes clearly. I think the behaviour right now is to just copy across the small variant records present for a sample, and fill any sample for which that record does not exist with ., because we don't have the call information for it. But yes, as we have assemblies we can be reasonably sure that as long as it aligns if there were a non-ref allele it would be called, so could impute it to have the reference genotype.

lrauschning commented 4 months ago

Actually think that would make sense, at least as behaviour that can be enabled via a flag. Would that be useful to you? Then I can see about implementing something, shouldn't be much effort.

mnshgl0110 commented 4 months ago

The general bioinfo software community might move away from . completely : https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531-GenotypeGVCFs-and-the-death-of-the-dot

mparker2 commented 4 months ago

Yes I think it would be useful. Currently I am just using msyd to identify variants in core syntenic regions and then using bcftools to fill in the missing genotypes with the reference allele, which is easy enough. But it would not make sense to impute the reference allele outside core-syntenic regions, where the sequence is actually truly absent from some accessions

mparker2 commented 4 months ago

I think if you were to stop using "." entirely like in the post Manish linked to, then an extra format field would be required to distinguish true reference alleles from non-syntenic variants. I'd be inclined to keep the "." for the latter though 😃

lrauschning commented 4 months ago

https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531-GenotypeGVCFs-and-the-death-of-the-dot

haha some greats puns in there :D I think the VCF people are in general looking to change some stuff to make VCFs scale better for many samples. But TBH I would stay with the current standard until that stuff is standardized and well established, also because the way SVs are encoded might be changed.

But it would not make sense to impute the reference allele outside core-syntenic regions, where the sequence is actually truly absent from some accessions

Yes, I think being able to differentiate syntenic from non-syntenic SNPs is the main advantage of the VCF-filtering of msyd! I think in principle it can be imputed for any merisyntenic regions including the reference (in non-ref haplotypes we don't currently call small variants). Would you think it makes more sense to have one switch enabling this behaviour for all ref-syntenic regions, or a separate one for coresyn and ref-merisyn?

mparker2 commented 4 months ago

to me it makes more sense to have the one option but I'll let you decide! 😄 thank you both for the work on msyd, I'm finding it very useful!

lrauschning commented 3 months ago

Have something that works for coresyn regions up on the main branch (which I think was your usecase anyway @mparker2 ?). For merisyntenic regions, the imputation should be possible with the same principle, but is causing some issues – also, the code could do with some cleanup since its now >1y old. Will be looking into that in the next days over on the vcf branch, as that might involve some breaking changes I don't want to have on main.

lrauschning commented 2 months ago

Ah, just noticed I typed this a while ago ago but never pressed send: Have implemented it in a way that hopefully makes sense, by copying from the reference assigned to that multisyn record for all syntenic orgs when there is no data – that means missing data should be kept if it is explicitly specified, if i am thinking this through correctly. To enable, the --impute flag can be passed.

The changes are up in the main branch now and seemed to work fine for me when testing – feel free to play around wtih them :D

schneebergerlab / msyd

why are ref alleles marked as missing? #12