rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
74 stars 19 forks source link

Increasing memory usage with --nCores #30

Closed jelber2 closed 4 years ago

jelber2 commented 4 years ago

Hi,

I have access to 75 cores on my machine along with 378 GB RAM, but STITCH 1.6.2 seems to use more and more memory depending on the value of --nCores used. For example, I can run ~800 0.5x coverage CRAM files for an ~100 Mbp chromosome with ~200,000 reference SNP sites using --nCores=75 with --K=2 and --S=1, but I can only use --nCores=30 with --K=40 and --S=1, otherwise memory usage goes up to ~100% and then starts using Swap.

Example command that I am using

Rscript /genetics/elbers/STITCH/STITCH.R \
--reference=scaffold2.ref.fasta \
--chr=scaffold_2_arrow \
--cramlist=cramlist.txt \
--outputSNPBlockSize=100000 \
--posfile=chr.pos.txt \
--outputdir=./ \
--K=40 \
--nGen=100 \
--nCores=30 \
--S=1 > stitch.log 2>&1

Is there anyway to lower the total memory usage per core/process? I ask because I would like to test out --S={4,6,8,...40}, but it would take a lot less time presumably if I could use more CPU cores.

rwdavies commented 4 years ago

RAM usage should be proportional to K K nGrids S nCores + another term that has a larger constant times K K nGrids * S. So for the first bit, proportional to about 2.3 Gb for K=40, nSNPs = nGrids = 200000, S = 1, nCores = 1. And there are 4 of those I think, so would be about 10 Gb for those objects, which seems to track with your nCores limit

Thoughts 1) You could try setting gridWindowSize to some value to run on grids instead of SNPs, which will decrease RAM and run time, but slightly increase error rate. More info here https://github.com/rwdavies/STITCH#notes-on-the-relationship-between-run-time-ram-and-performance 2) This seems like a lot of SNPs? Is this human data? If yes, I'd break the genome into smaller windows, see here https://github.com/rwdavies/STITCH/issues/2 Alternatively, I would consider not including SNPs with very low allele frequency, like MAF under a percent, as unlikely to impute well (though again depends on your setup / species, as while this is true of wild / outbred, more inbred / closed / bottlenecks might be fine with <1%) 3) K=40 might be too much for 800 samples at 0.5X. Would be worth exploring K=10 vs 20, 30 and 40 first maybe, then checking out values of S

Hope that's useful, Robbie

jelber2 commented 4 years ago

Dear Robbie, Actually it is simulated data on a bird genome (yes, it is a lot of SNPs). Thanks a lot for your suggestions!