running time for `intronets_simulate_training_set.snake`

xin-huang commented 8 months ago

In my desktop (AMD Ryzen™ 9 5950X)

intronets_simulate_training_set.snake took 45 min to create the five h5 files:

100k_random_wo.h5
fwbw_100k_random_wo.h5
gradient_100k_random_wo.h5
poschannel_100k_random_wo.h5
poschannel_scaled_100k_random_wo.h5

xin-huang commented 8 months ago

while in the cluster, it cannot finish the job after 8 hours in the himem partition

jalhackl commented 8 months ago

This is very strange and interesting. I think the main reason is that 64 cores (or even 128?) is much too much, with a lower number it is significantly faster, e.g. on your Ryzen with 16 cores / 32 threads. (I never used so many cores, so I also did not notice this tremendous slow-down. Using the himem-partition for this step certainly should not be necessary.) I think the problem is that we are simulating thousands / millions of very small (50kb) fragments (in contrast to the ‘usual’ setting where only a few, e.g. 100, very long, e.g. 200 Mb, chromosomes are simulated) using the sstar.simulate function. For this setting, using a higher value for ‘threads’ in the function (which are in fact processes, as we discussed last time), i.e. investing more cpus, does not provide a speed-up anymore, but due to the overhead caused by multiprocessing (distribution of all the small tasks etc.) it gets slower and slower. So it would work well for large chromosomes to use so many cores (specified by ‘threads’ in the sstar.simulate function), but for the small fragments it is even impairing the performance dramatically.

For now, one should run the sstar.simulation-function with less cores (e.g. 16 or even less, specified by the ‘threads’-variable), then it should be similar fast on the cluster as on your desktop PC. (The following seriation-part of the script probably can use the cpus more efficiently.)

In principle, this should be ‘sufficiently’ fast; however, a further refactoring could be advantageous. One could try to start the sstar-simulations in batches (and not one command as it is now). As discussed last time, it is rather difficult to have multiple processes writing on a h5-file (it is possible using MPI – see https://docs.h5py.org/en/stable/mpi.html -, but I would rather prefer to avoid that). If necessary, one could also start multiple simulate-seriate-create_h5-processes-delete_files-workflows via snakemake and then have another rule which merges them. I think this could improve the performance (depending on the number of snakemake-jobs ad infinitum), however one has to take into account that merging of the h5-files also imposes some overhead etc.

In any case, there remains the difficulty that the optimal distribution of resources (i.e. how many 'threads'/processes for the sstar.simulate-calls, how many parallel snakemake-jobs / cpus in multiprocessing.pool) depends on the data to be simulated (low vs. big number of replicates and small vs. high sequence length).

xin-huang commented 8 months ago

Refactored in https://github.com/jalhackl/introunet/pull/16

jalhackl / introunet

running time for `intronets_simulate_training_set.snake` #15