Suggestion on region size?

hliang commented 7 years ago

Is there any suggestion on setting the size of imputation region?

In the Nature Genetics paper(doi:10.1038/ng.3594), a 10 Mb human region was tested. But there is also an example of running STITCH on a whole chromosome (Mouse example 1 - Run whole chromosome) in the kindly provided examples.R. Will larger region size affect accuracy? or it's just set smaller to make sure the computer will not complain?

rwdavies commented 7 years ago

If the buffer is well set, region size should not affect accuracy, and then chunking decisions are governed by RAM / CPU considerations.

In any species, if you are imputing in chunks (i.e. not whole chromosomes), there is a minimum buffer size you should probably try to impute at once, governed by recombination rate. In humans from one population (example British), LD (r2) should drop near 0 by ~250 kbp, so one could chunk the genome into windows and use that much on either side as buffer, to ensure there are no edge effects in your imputation (i.e. if you imputed in windows but had no buffer, and there was an LD hotspot in that population near the edge of the region, then SNPs between the hotspot and the end of the window may impute poorly since there are a limited number of SNPs in that LD block).

In the mice used in the paper, LD decayed much more slowly (see Figure 1 of Nicod et al in the same issue of Nature Genetics). Had we wanted to chunk the chromosomes, we should probably have used a buffer of about 1 or even 2 Mbp.

However, for the mice, since there were only 2000 of them, and they were ~0.15X, I could re-run genome-wide analyses on a modest server (48 AMD cores with lots of RAM) in ~2-3 days when parallelising by chromosome, so I didn't bother chunking any further than that. Had I chunked, then I would need to use overlapping buffers, so the total run time would have increased.

For the humans, with 12000 of them at >1.5X, given all the conditions I wanted to test, it was way too slow to run on that server, so I farmed the job to the cluster at the WTCHG. I further broke the 10 Mbp region down into 500 kbp regions (with 100 kbp buffers I think) because it uses less RAM and I wanted to submit jobs asking for few cores as when I submit jobs asking for fewer cores they get scheduled and run faster. There is a chance imputation accuracy was adversely affected by the buffer size but I used the same buffer for all methods in the paper, hopefully minimising that concern.

Hope that helps!

hliang commented 7 years ago

Thank you very much for this great explanation! The details on LD decay are especially helpful.

hliang commented 7 years ago

Just want to share some run time metrics of a test analysis: 12800 human WGS samples, low coverage ~0.1X, cram format. Stitch was run on small regions of size = 10 Mbp, buffer = 100kbp (about 300 regions for human genome). K=10, nGen=8000, niterations=40. The analysis was done in two parts, pretty similar to the ones shown in STITCH's examples.R. But I also wrote some scripts to wrap them together, and parallel the jobs:

part1 - generate inputs for each sample. There is a tweak here: each job uses 2 cores and processes all the regions for one single sample (i.e. one cram file). If we process a region for multiple samples, our file system (lustre) would not be happy when thousands of samples are analyzed together, causing slow read/write. Each job takes ~ 10 minutes.
part2 - imputation using previously generated input. Each job uses 22 cores and processes one 10 Mbp region. The run time depends on number of samples, For a 10 Mbp region with 12800 samples, each job takes ~ 3.3 hours.

On our cluster, each compute node has 24 cpu cores and 64 GB mem. When paralleling jobs on 50 nodes, the total run time is less than 2 days.

rwdavies commented 7 years ago

Neat! Sounds like a good strategy. Since this is humans, did you use reference panels? (STITCH can initialize with reference panels, but only initializes; after 1 iteration, the updating is done as normal with only sample data. Also, I haven't really tested this exhaustively, so I don't know if it makes a big difference, but could allow you to use higher K and fewer iterations for the same performance).

You're confident you got the input/*RData files properly sorted between parts 1 and 2? (for example does the allele frequency match up as expected between 1000G and your VCF after part 2)? Also, what version was this? v1.3.2 and later should allow for much faster analysis of CRAM files as it avoids a painful CRAM -> BAM conversion used pre-1.3 (although your pipeline already sounds quite fast!)

hliang commented 7 years ago

No reference panels was used. Version is 1.2.4 with slight modification (to get around minor bugs like #1 and #3). I would like to test the latest version and reference panel when I get a chance.

In part 1, only one sample is fed into STITCH, so the input/RData files are all named sample.1.input.RData. I need to rename them by the iBam (index of the sample in the bamlist/cramlist), so that STITCH can accept them in part 2. I will check the allele frequency to confirm things go as expected.

BTW, my posfile contains 8 million random SNPs sampled from 1000G.

rwdavies / STITCH

Suggestion on region size? #2