Handling indels - Githubissues

PMuchina commented 3 weeks ago

How does QUILT handle indels? It seems indels are skipped when building the binary reference.

[2024-10-31 22:41:27] Program start [2024-10-31 22:41:27] Begin converting reference haplotypes [2024-10-31 22:41:27] Using reference information from:/usr/home/ironbank/researchdata/qgg/zexi/BSF/Peter/Chr1/QUILTv1.0.5/Chr1_reference_panel.bcf [2024-10-31 22:41:27] Using strategy of first imputing using common SNPs and then using all SNPs, with allele frequency threshold:0.001 [2024-10-31 22:41:27] Begin get sites and haplotypes from reference vcf [2024-10-31 23:07:55] End get sites and haplotypes from reference vcf [2024-10-31 23:07:55] There were 59759 skipped variants when processing the reference VCF (not bi-allelic, not a SNP, not unique position [2024-10-31 23:07:56] There are 0 common and 0 rare (0 total) variants in the left buffer region -499999 <= position < 1 [2024-10-31 23:07:56] There are 428384 common and 301 rare (428685 total) variants in the central region 1 <= position <= 4999987 [2024-10-31 23:07:56] There are 33651 common and 2 rare (33653 total) variants in the right buffer region 4999987 < position <= 5499987 [2024-10-31 23:07:56] There are 0 regions out of 14438 below minimum recombination rate, setting them to minimum rate [2024-10-31 23:07:56] There are 0 regions out of 14438 above maximum recombination rate, setting them to maximum rate [2024-10-31 23:08:06] Using nMaxDH = 114 [2024-10-31 23:08:11] Build mspbwt indices [2024-10-31 23:08:13] Done building mspbwt indices [2024-10-31 23:08:13] Save converted reference haplotypes [2024-10-31 23:08:14] Done converting reference haplotypes

PMuchina commented 3 weeks ago

The above is for QUILT2_prepare_reference.R. So I tried QUILT_prepare_reference.R, and I got the same. It seems STITCH only handles biallelic SNPs

[2024-11-04 23:57:59] Program start [2024-11-04 23:57:59] Begin converting reference haplotypes [2024-11-04 23:57:59] Using reference information from:Chr1.hap.gz and new_legend.gz [2024-11-04 23:58:35] Begin get haplotypes from reference [2024-11-04 23:58:35] Load and validate reference legend header [2024-11-04 23:58:35] Load and validate reference legend Error in validate_pos(decoy_legend, chr = chr, stop_file_name = paste0("The reference legend file ", : The reference legend file new_legend.gz ref column entry TTCCTCCATTTTCGTCATCATCAAATCACAGGCTATACACGCT in row 19 contains is not one or A, C, G or T. STITCH is only supported for bi-allelic SNPs Calls: QUILT_prepare_reference ... get_haplotypes_from_reference -> validate_reference_legend -> validate_pos Execution halted

Zilong-Li commented 3 weeks ago

Thanks @PMuchina,

Unfortunately, QUILT does not support imputing Indels at the moment. To my knowledge, Indels or SVs imputation is always challenging.

sounkou-bioinfo commented 1 week ago

Hi all, Is the challenge due to the high error rate in the reference panels for indels (this is less of the case for new variant calling pipelines) , variants representation or the algorithms requires 1bp regions ? As i understand, multi-allelic sites are also not imputed, maybe for similar reasons ?

rwdavies commented 1 week ago

I'd say a key reason is just historical / lack of time. I wrote much of STITCH (which includes the code to read in and process bams at variant sites) back in ~2012-2014 or so, when indels were less of a priority, and much less accurate. Adding bi-allelic indels to the current code base isn't conceptually hard, but would require time (say 1-2 weeks full time). I'm in industry now and don't have that kind of time. Zilong could do it but it would need to be weighed against his other responsibilities and projects. Multi-allelic variants (indels or SNPs) are slightly harder though still doable (especially in QUILT) though for STITCH they'd be a bit trickier (I think)

rwdavies / QUILT

Handling indels #45