When I use QUILT2 to imputate 82 individuals, the running speed is very slow

bbdragon1 commented 4 days ago

My reference panel has 1600 individuals and 4368645 variants. When imputationin the BAM file of 82 1x depth individuals, I found that the imputation speed was very slow. I also changed -- rare_af_threshot=0.005, but found that his imputation speed did not change. May I ask if there are any other parameter settings that can speed up my filling process?
I imputate the pig's chromosome 1 with a span of 274.3Mb, chunk size = 5Mb, buffer = 0.5Mb, and the Rdata reference panel is ready. When running on the server, the calling node is 1 and the number of calling cores is 16;The timputation run time is approximately 11h; Here is my code: $gtime -v $QUILT2_PATH \ --prepared_reference_filename="RData/QUILT_prepared_reference.${chrom}.${regionStart_buffer}.${regionEnd_buffer}.RData" \ --bamlist="$Bamlist" \ --chr="$chrom" \ --method=diploid \ --nCores="$Threads_num" \ --regionStart="$regionStart_buffer" \ --regionEnd="$regionEnd_buffer" \ --buffer="$buffer" \ --nGen=100 \ --rare_af_threshold=0.005 \ --outputdir="$Output_dir" \ 2> "$imputation_gtime_log"

bbdragon1 commented 4 days ago

When I called GLIMPSE2 with similar parameters, I found that the running time was about 2 hours

rwdavies commented 3 days ago

GLIMPSE2 is usually 2-10X faster than QUILT2 with default parameters, depending on reference panel size, depth, etc. QUILT2 is usually at least as accurate as GLIMPSE2, and more accurate in certain situations, like very low coverage, as well as longer reads or higher heterozygosity (as often seen with non-human populations)

1600 individuals (presumably 3200 haplotypes?) is actually not a very large number of haplotypes for a reference panel (compared to what's available for humans). rare_af_threshold will partition SNPs into two categories - those with AF below that threshold and those above. It is particularly useful for WGS human panels where you might have ~90% of SNPs with AF below that threshold. Here I reckon that number is much lower (unless the dataset is overwhelmingly singletons?).

To speed things up I would focus more on the fact that there are so few reference haplotypes (comparatively speaking). I'm actually not even sure if QUILT2 offers much of an advantage over QUILT1 here. The key parameters are: n_seek_its = 3 which controls how many iterations are done between imputing with the small reference panel, and going back to the full reference panel Ksubset = 600 which controls how many haps are in the small reference panel nGibbsSamples = 7 which controls how many Gibbs samples are performed

You could e.g. try dropping Ksubset to 200 (or even 100), if the 1600 reference haplotypes are decently differentiated (e.g. a couple subpopulations). You could also maybe drop nGibbsSamples to say 3.

You might also find small efficiency gains overall by running several independent jobs with fewer cores each, then merging the VCFs.

Do you have measures of accuracy e.g. some of the 82 1X samples sequenced at higher coverage, you can use to confirm which method, or what parameters of a method, might be more accurate?

rwdavies / QUILT

When I use QUILT2 to imputate 82 individuals, the running speed is very slow #47