Running PPanGGOLiN on larger data

mkesapra commented 2 years ago

Hello,

We would like to build pangenomes for nearly 10k genomes. We have trouble running PPanGGOLiN for >5k genomes. The output file says it successfully estimates the optimal number of partitions and then gets stuck after 'Launching NEM'. There is no other error. We were able to build pangenomes for 3k genomes (~ 450GB). Initially we thought it could be a memory issue, but now even on 1.5TB the run fails.

I have attached the output files for 3k genomes which went through successfully, and the 5k genomes which was stuck at the partitioning stage.

Job_ouputs.zip

Are there any other configurations on how to run PPanGGOLiN on larger data?

Thank you!

Jtrachsel commented 2 years ago

Hi @mkesapra, I've had this same issue. For me the solution was to run the partitioning step with a reduced number of processors.

So instead of running one of the all in one workflow commands you can run each step individually and just specify the partitioning step to use only 2 or so processors.

axbazin commented 2 years ago

Hi,

Indeed here the problem is likely a RAM issue, PPanGGOLiN uses more RAM as it uses more CPUs for the partitionning step. 128 is quite a lot, way more than what I've been using. I'd recommend either reducing the number of cpus overall, it should still run relatively quicky (i've done 10-15k+ genomes on 16 cpus in a day or so), or following @Jtrachsel excellent solution, use the step-by-step approach and reduce the number of cpus for the partitionning step only, as it is the only step that has this RAM problem.

Adelme

mkesapra commented 2 years ago

Thank you @Jtrachsel and @axbazin. Running each step individually and just specifying the partitioning step to use 16 cpus has helped me to get through. But the only issue is the partitioning step alone for 10k genomes took around 3 days 5 hours on 16 cpus. I am wondering if there is anything which could help to make it run faster.

axbazin commented 2 years ago

Hi,

We've not tested extentively changing the following parameters on large number of genomes, but we have on smallish (1000 max) sets of genomes. They will impact speed, but can also impact result quality to some extent :

setting --beta 0 will deactivate spatial smoothing. This will not make use of a heavy part of ppanggolin statistical approach (the Markov random field part) so it should be faster. It will likely have an impact on results though when testing on smaller datasets the impact was marginal.
reducing --chunk_size (to 300 for example). Smaller chunk_sizes are faster to compute, but results may vary a bit more. The current parameter (500) was chosen because it generally yields extremely stable results. Lowering it a bit should still yield trustworthy results as long as it's not too low (e.g. less than 100 is possibly too low)
Provide -K. While this is usually computed automatically, the computation can be a bit heavy. you can reuse what you got from previous runs on a subset of your genomes for example, or if you know and expect subpopulations in your dataset you can set that number to (number of subpopulations + 2). It's usually better to not do that because that can have an important impact on your results if not chosen well.
Depending on your computing resources and how your computing cluster is configured, be wary of which --tmpdir is used. It should be a local disk rather than a shared one. PPanGGOLiN uses the default TMPDIR automatically, but in some infrastructure this may not be the best one. If the TMPDIR is not properly set this can have a very important impact on speed. It will not change the results in any way.

If you actually experiment with those parameters on very large datasets, I'd be very curious to know your conclusions/observations, if you're willing to let me know !

Adelme

mkesapra commented 2 years ago

Thank you @axbazin. I will refer to these parameters in future if we have to run any other large number of genomes and will let you know the observations.

bioinformagica commented 2 years ago

If you actually experiment with those parameters on very large datasets, I'd be very curious to know your conclusions/observations, if you're willing to let me know !

Hello, i also had issues on the partition step, so i did test it on my cluster

I follow the @Jtrachsel instructions and run the workflow one step at a time. In my first attempt i used 60 cores (ppanggollin all) and it got stuck on the partition step with a core bus error. Then i split the job an run the steps one a time with 40 cores. In the partition steps i still use 40 cores but i changed the parameters as suggested be @axbazin : --beta 0 and --chunck_size 300. The first run give me the this log:

100%|__________| 19/19 [07:00<00:00, 22.12s/Number of number of partitions]2022-09-12 00:40:40 partition.py:l390 INFO   T
he number of partitions has been evaluated at 9

So i use 9 as the -K value in te partition step.

The partition step alone took 7h hours.

Submit time  : 2022-09-12T01:04:07
Start time   : 2022-09-12T01:04:07
End time     : 2022-09-12T08:30:50
Elapsed time : 07:26:43 (Timelimit=10:00:00)

Job ID: 1730126
Cluster: i5
User/Group: redacted/bj
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 40
CPU Utilized: 9-15:13:30
CPU Efficiency: 77.64% of 12-09:48:40 core-walltime
Job Wall-clock time: 07:26:43
Memory Utilized: 4.44 TB
Memory Efficiency: 492.14% of 922.85 GB

axbazin commented 2 years ago

Thanks a lot for those infos ! I'm glad you managed to make it work in the end.

4.4 TB of RAM is quite massive, how many genomes did you have in this run?

bioinformagica commented 2 years ago

Sorry, i forgot to add this info, i run with 16.3 K genomes of one specie.

Humm I agree 4 TB seems a lot, maybe this value is the amount of RAM that was dedicated for that particular task, but it is not necessary the amount of RAM used ? I don't know much about Slurm to know how this things work 😅

labgem / PPanGGOLiN

Running PPanGGOLiN on larger data #89