schneebergerlab / msyd

MIT License
9 stars 0 forks source link

HDR regions not included in coresyn calls, but still present in filtered VCF #15

Open mparker2 opened 2 months ago

mparker2 commented 2 months ago

Hi @lrauschning,

when I run msyd call in --core mode on some potato haplotypes, I get a nice PFF file of the coresyn regions and a merged VCF for the same coresyn. The SNPs and indels which do not overlap coresyn regions are nicely filtered out, which is exactly what I want.

Msyd breaks the coresyn regions on HDRs. I don't know if @mnshgl0110 would agree with this, since HDRs are considered as part of a syntenic region by syri, but it works quite nicely for me. I want to get rid of them for my current analysis. However, they are not filtered out in the VCF. This doesn't seem quite correct... imo either the PFF should include the HDR regions as coresyn (and the VCF include them), or they should be filtered out of the VCF.

What is your opinion?

lrauschning commented 2 months ago

Hiya, nice to see the VCF filtering (mostly) works! The reason HDR regions break coresyn is that they do not have corresponding SYNAL annotations, which is what (the current iteration of) msyd works on so that we can have exact alignments with basepair precision. I think the HDRs might be retained because they start right before/after a coresyn region, and we fetch all variants intersecting a multisyn (end-inclusive) for reporting in the merged VCF. If you just want the snps, not passing -x/--complex should filter out any VCF records with symbolic alleles incl. HDRs. Other than that, grep -v HDR would be a quick workaround. Might be worth adding a CLI option to restrict merging to records strictly within a multisyn, though.

mparker2 commented 2 months ago

That makes sense. Using --complex is not a foolproof solution because syri is able to create VCFs with both symbolic or full sequence alleles (using --hdrseq)

lrauschning commented 2 months ago

Ah, true. Then I'll look into adding a CLI option for strictly contained records. I think filtering for specific types of records is probablybest left to the user, though.