odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

SER meaurement - merge multiallelic sites first? #66

Open JosephLalli opened 8 months ago

JosephLalli commented 8 months ago

When handling multiallelic sites, the best practice is to split multiallelic sites before phasing.

However, I'm not sure how to handle these sites when measuring switch error rate. The posted tutorial seems to leave multiallelic sites as split biallelic sites when measuring SER performance using shapeit5_switch. Is that what users should do when measuring SER in their data sets?

SER = #switches/#of hets. If a 1|2 heterozygous site is split and then erroneously phased , I'd think that is one switch error rate at one heterozygous site, not two errors (0|1 and 1|0) at two sites (see below for an illustration of what I mean). Thus poor performance at multiallelic sites (esp. sites with many alleles like STRs) would artificially inflate the SER.

chr20 1000 A T,C 1/2 split -> chr20 1000 A T 0/1 chr20 1000 A C 0/1 phase -> chr20 1000 A T 0|1 chr20 1000 A C 1|0 merge multiallelics, preserve phasing -> chr20 1000 A T,C 2|1