odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
61 stars 9 forks source link

UKB WGS: possible to phase per independent block instead of per chromosome? #35

Open WeiCSong opened 1 year ago

WeiCSong commented 1 year ago

Hi, thanks for the great tool! On UKB WGS data, since concatenating all chunks per chromosome will produce an extremely large bcf, I wonder whether it is feasible to segment the hg38 into 1381 independent block by LDetect (by this paper) and do the phasing per block? The workflow will look like

for block in BLOCK: do QC and concatenate all chunks in block; phase_common in block;

skip ligation step

 phase_rare in block, no smaller chunk;

done

With the saaumption that 1)SHAPEIT5 supports multi-threading when analyzing one large chunk; 2) haplotype scaffolds of an independent block would be accurate, I think this workflow will perform similarly to the tutorial. Do you think this workflow make sense? Thanks for your help!

odelaneau commented 1 year ago

We looked quite a lot at this in the UK Biobank data. We find that the best way was to run phase_common in 20cM chunks and phase_rare in 5cM chunks. Reducing chunk size is not ideal as it will prevent capture Identity-By-Descent sharing between samples.

Producing whole chromosome haplotypes is not really an issue as BCF/VCF file format allows for random access (bcftools / HTSlib).

Also, we made phased haplotypes with SHAPEIT5 for 200,031 UK Biobank samples with WGS. This will be released very soon (this summer I guess).

WeiCSong commented 1 year ago

Thanks for the valuable suggestion! An additional question: If using the SNP array data for the entire chromosome for building scaffold, will it be more accurate than using all coomon SNP per 20cM and then ligate it? Not sure which is more important, SNP density of region length?