sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

Howto: join biallelic SNPs into a single record? #1179

Open hyanwong opened 10 months ago

hyanwong commented 10 months ago

Can I add a question to the How-to guide, which is how to perform the bcftools norm -multiallelics algorithm on a VCF stored in SGkit? Is this even possible? In particular, we have VCFs in which multialleic SNPs have been split into multiple sites all at the same position (yuck!), and it would be great to be able to get them back into a sane state without having to go through the VCF pipeline multiple times.

I don;t know if this is a reasonable thing to want to do in sgkit, however. Here's the quote from the bcftools docs:

-m, --multiallelics -|+[snps|indels|both|any] split multiallelic sites into biallelic records (-) or join biallelic sites into multiallelic records (+). An optional type string can follow which controls variant types which should be split or merged together: If only SNP records should be split or merged, specify snps; if both SNPs and indels should be merged separately into two records, specify both; if SNPs and indels should be merged into a single record, specify any.

jeromekelleher commented 10 months ago

This is more than a question I think, we would need to code this up explicitly as part o the library. It's not a trivial operation.