pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
368 stars 41 forks source link

Run times slow with 76 genomes #298

Closed brettChapman closed 9 months ago

brettChapman commented 1 year ago

Hi

Continuing on discussion from https://github.com/waveygang/wfmash/issues/171

PGGB is running slow with 76 haplotypes, ran per chromosome on assembled pseudomolecules, on genomes which are around 4-5Gb in size.

-s 100Kbp -p 93 -k 316 poa_params="asm20" poa_length_target="700,900,1100" transclose_batch=10000000

Remaining parameters default.

brettChapman commented 1 year ago

Current run time nearing 50 days or more.

ekg commented 1 year ago

My first question is if the instruction set in the installed version of pggb is older, leading to less wide vector instructions and slower processing.

We had been dealing with this before. I'm not sure of the state of the binaries accessible from dockerhub and conda.

brettChapman commented 1 year ago

My installed version, which I pulled from docker hub using Singularity is version 0.5.3 from February 10.

AndreaGuarracino commented 1 year ago

Docker/Singularity should be about ~30% slower than building from GitHub source (at least on our cluster).

Can you also share your current (I suppose very long) PGGB .log file?

brettChapman commented 1 year ago

Surprisingly chromosome 1 has just completed after running for 50 days. I can provide that log file. Chromosome 1 would usually complete first. I expect the other chromosomes will take an additional week or two.

I can provide the log file but it's 1Gb in size. How could I get it to you?

brettChapman commented 1 year ago

I just gzipped the log file. Now down to 25Mb. What's the limit for file attachments on here?

AndreaGuarracino commented 1 year ago

LOL! Now it is a nice size! I think sharing it on GitHub could work. Or you could put the file temporarily on Google Drive or similar.

I would like to check if your bottlenecks are in wfmash mapping and/or alignment, the GFA->ODGI conversion (it happens in smoothxg), the PO alignment in smoothxg, etc...

brettChapman commented 1 year ago

I've attached the gzipped log file. Hopefully no issues with the attachment.

barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.03-29-2023_07:39:12.log.gz

AndreaGuarracino commented 1 year ago

The log doesn't look 100% healthy, but I can see that the 1st round of "path fragments embedding" took ~18 days! I suppose the other 2 rounds took similar times too. That's surprising.

zgrep 'path fragments' barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.03-29-2023_07.39.12.log.gz

[smoothxg::(1-3)::smooth_and_lace] embedding 135544223 path fragments: 100.00% @ 8.57e+01/s elapsed: 18:07:07:39 remain: 00:00:00:00
[smoothxg::(2-3)::smooth_and_lace] embedding 108563793 path fragments:  1.96% @ 2.56e+01/s elapsed: 00:23:07:27 remain: 48:04:05:45gfaffix barley_pangenome_1H_s100000_l0_p93_k316_B10000000_G700-900-1100_Pasm20/barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.gfa -o barley_pangenome_1H_s100000_l0_p93_k316_B10000000_G700-900-1100_Pasm20/barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.fix.gfa
AndreaGuarracino commented 1 year ago

Hi @brettChapman, sorry for the extremely long wait. I worked on smoothxg recently and I am still finalizing several hacks to improve both memory usage and runtime. In your case, the "path fragments embedding" takes a lot of time. That step is single-threaded currently. In https://github.com/pangenome/smoothxg/pull/197 there is a version of smoothxg that parallelizes such a step and that introduces several memory optimizations too.

If you can work also with GitHub branches, it would be helpful if you could run the same smoothxg command line by using the avoid_2_graphs_in_memory branch. With 32 threads, I hope the path fragments embedding will finish in a decent amount of hours.

AndreaGuarracino commented 9 months ago

@brettChapman were you lucky enough to try the updated smoothxg (or pggb) with less issues?

brettChapman commented 9 months ago

Hi @AndreaGuarracino

Yes, I've used the latest version now and found smoothxg ran a lot faster.

Recently we've gotten access to a larger cluster paying at a higher cost, with SSD and 2TB RAM. We've found our PGGB jobs ran significantly faster, cutting months off the run time. Previous systems we've had access to have been limited to mechanical drives and limited RAM, but these were public funded resources.

AndreaGuarracino commented 9 months ago

Thanks for the update! Saving months seems to be hot enough for the environment and global warming xD