Breaking up make_table into multiple files

Hi,

Apologies as I am relatively new to this population genetics work. I am trying to compute a lookup table for n=450 and N=600 but it is taking too long to run on a high performance computing cluster, exceeding the time limit even when using multiple threads. Therefore, I am trying to figure out how to construct multiple tables and join them into a complete lookup table or if there is a method to pause, save and continue at some predefined checkpoints?

Based on pyrho make_table --help, there are two arguments that seem to help me in my case as shown below: -S STORE_STATIONARY, --store_stationary STORE_STATIONARY Name of file to save stationary distributions -- useful for computing many lookup tables sequentially. -L LOAD_STATIONARY, --load_stationary LOAD_STATIONARY Name of file to load stationary distributions -- useful for computing many lookup tables tables sequentially.

Are these the correct arguments to use for storing information at different checkpoints and would you have any examples on how I can use this to build my way to n=450 or even more? Please suggest other methods that I can use to build a lookup table of n=450 or more but spread into smaller files/jobs.

Thank you in advance!

Regards, Charles

Hi Charles,

Sorry to hear you're running into issues. I don't have a good solution to your problem, but here are some thoughts in the hopes that they're helpful:

Runtime is cubic in N, and I don't recall the exact runtimes for the lookup tables I made for the paper, but I remember it being ~a day for N=256 using 32 cores. That makes me think that you're probably looking at ~12 days of computation (even multithreaded) for N=600.
Second, the computation of the likelihoods consists of two parts -- first a stationary distribution is computed, and then, if there are population size changes the stationary distribution is integrated forward from the most ancient time point to the present. The --store_stationary and --load_stationary flags allow you to split these two components up. In practice --store_stationary will run both bits but then save the stationary distribution, while --load_stationary will take a stored stationary distribution and use that instead of computing it. Depending on the breakdown of how much time is getting spent in your case on the two steps, you may or may not be able to get some meaningful savings out of this. To avoid performing the second part of the likelihoods (i.e., integrating the stationary distribution forward) when using --store_stationary you would need to input a constant size demography, where the size matches the most ancient size in the demography that you want to compute. You would then need to throw away the lookup table produced by that run, but you could keep the stored stationary. This might help, but the best case scenario is that computing the stationary and the integrating i forward are currently roughly equal in terms of runtime, in which case you could cut your whole job into two jobs of roughly half the time. But even still, I suspect that this will result in going from ~12 days to ~6 days per job, which is probably still too long for a shared cluster.
On a more practical level, if I were you I would compute lookup tables for smaller sample sizes (e.g., n=200 or less, with N<=256 or so) and then infer recombination maps using subsets of your 450 individuals. Once you have a lookup table, the inference part of pyrho is quite fast, so you could do a bunch of random subsets of your 450 individuals, compute recombination maps for each and then average the results. None of the subsampling or averaging is implemented in pyrho so it would require a bit of scripting on your end.
Lastly -- it would be possible in principle to do some kind of more thorough checkpointing. One could checkpoint the likelihood calculation for each individual recombination rate either by epoch (i.e., the time points at which population sizes change) or could checkpoint the likelihood calculations for different recombination rates (currently thee computations are parallelized across different recombination rates, and each recombination rate can be computed independently). This would probably need to happen in ldpop, with some plumbing to pyrho. I don't have the bandwidth to do this, but I think it is doable in principle, and I would be happy to accept a PR.

I hope this helps!

Jeff

popgenmethods / pyrho

Breaking up make_table into multiple files #26