szpiech / lassip

LASSI-Plus: A program to calculate haplotype frequency spectrum statistics
GNU General Public License v3.0
6 stars 2 forks source link

Long run times with salti #1

Closed stsmall closed 3 years ago

stsmall commented 3 years ago

Hi @szpiech, Thanks for pulling together all these programs under one. I am attempting to classify selection in my species genome. I have created the spectra file and when I try to run salti with input spectra it just doesnt finish. There are no errors and the process is running when I check ... but no output files and after 2 weeks it still hadnt finished. I started with a vcf containing two populations and followed the help to create spectra for each population. Then I ran the --avg-spectra to create a null input. Top K haplotypes equals 10. I have 74 individuals (148 haplotypes). The genome size is 210Mb. I am running in RedHat Linux. 500G RAM.

num lines/windows in spectra file : 107873

null spectra:

K 10 npop 1

K 0.202829 0.129039 0.103271 0.088702 0.0787052 0.0712654 0.0654245 0.0607006 0.0568397 0.0536012

command line: lassip --spectra X.K.lassip.hap.spectra.gz --salti --winsize 201 --winstep 51 --threads 40 --null-spec K.lassip.null.spectra.gz --out K.X

szpiech commented 3 years ago

Hi there, My first thought is that you might be running out of RAM and into swap, which would make the program grind to a crawl. Although with 500GB of ram and only 210Mbp genome size, this seems somewhat unlikely. I assume, looking at top that the RAM hasn't maxed out and the program isn't taking up <1% CPU cycles? I seem to recall that when I ran human chr1 (~250Mbp) on a sample of ~100 on a similar computer with 30+ threads it completed in about 20-30 min.

Hmm, I suppose you could try splitting your data by chromosome/contig and run each one separately (with the same null spec file). Let me know if this helps, I'm currently a bit stumped as to what's happening with your run.

stsmall commented 3 years ago

It is using 11.4G of the 500G available. All 40 cores are running at 100% CPU. This is a run with 1 chromosome (17Mb) and 1 population. Current run time is 17 hours ... still not finished. No errors. Maybe I set it up incorrectly? Would it possible for me to send you the spectra files for the 17Mb chromosome, maybe I am missing something obvious.

stsmall commented 3 years ago

hmmm. well I tried the example with --lassi and it finished very quickly. I also reran the 17Mb chromosome with --lassi and it is also finished (13 seconds). when I used --salti on either dataset it is has not finished and still running.

szpiech commented 3 years ago

Oh, you know, you may need to reduce --max-extend if your windows are fairly dense per 100kb. Maybe try something in the 10-20kbp range. Alternatively you might consider something like --winsize X --winstep 2X or some combination.

--max-extend should probably be based on number of windows instead of bps, something to add for a future update.

stsmall commented 3 years ago

OK. just so I understand. --winstep should be 2 times larger than winsize? so if --winsize 101 then --winstep 202? I tried this on the example and my 17Mb file: --salti --winsize 101 --winstep 202 --threads 40 --null-spec K.lassip.null.spectra.gz --out K.X --max-extend-bp 10000 still running atm. Should I be expecting --salti to be as quick as --lassi?

szpiech commented 3 years ago

saltiLASSI will never be as fast as LASSI, the likelihood is way more complex. My first suggestion would be to reduce --max-extend just because I currently suspect the long run times are being caused by including tons of windows. In the human data I was working on, 100kb produced good run times and scores that didn't look very inflated. If you set this parameter too high you can get very long run times and inflation of the scores. Something else to consider (either separately from max extend or jointly) is making the windows more sparse across the genome. If you made the step, for example, twice the window size you'd cut the # of windows by a factor of 4 (from your current run) and that would help speed up computations too. Fiddling with these two settings should help with run time.

stsmall commented 3 years ago

OK. Thanks! I will try a few combinations and let it run. I can report back tomorrow.

stsmall commented 3 years ago

seemed to work. I reduced the max-extend-bp to 10000. Took anywhere from 30min to 4-5 hours. thanks for your help!