josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 6 forks source link

HPC tunning #14

Closed Proginski closed 11 months ago

Proginski commented 1 year ago

Hi Josue,

I'm Paul, we met at the SMBE poster session last month. Thanks for genEra again !

I suppose HPC are the most suitable support to use your tool, at least to process entire and consequent genomes, so here are two questions : 1 - Do you have a rough idea of the memory consumption genEra has? In other words, is there any point to provide it with hundreds of Gb of RAM? 2 -Regarding the reading/writing time potential bottleneck: can we assume most of the computation is made on the temporary directory (-x argument), or are there a lot of files read/written directly on the working directory ? I was considering having my wd on a store space and having only the temp dir directly on the HPC node.

I'm starting with Human, mouse, and Cie ... I'll let you know how it goes ;)

josuebarrera commented 1 year ago

Hello Paul!

Thank you for reaching out, I'm happy that we met during the SMBE!

As you say, an HPC is the most suitable environment for GenEra for the time being, given the vast amount of data that is generated before obtaining the final results. We're thinking of ways to solve this drawback, such as hosting GenEra on a public online server, but that may take some time to implement. Regarding your questions:

1 - A standard GenEra run does not consume too much RAM (e.g., < 50 GB RAM for big proteomes), and this peak in RAM consumption always happens during step 1 of the pipeline (i.e., when running DIAMOND against the database(s)), whereas the rest of the pipeline consumes < 5GB RAM. There are two exceptions where the RAM consumption vastly increases: when running FoldSeek instead of DIAMOND (-Q and -B) and when running a protein-vs-genome search (-f). The reason is that both MMseqs2 and FoldSeek use lots of RAM to speed up their computation times, while DIAMOND is also super fast without needing that much memory. So you don't need to allocate hundreds of GB of RAM for a standard GenEra run.

2 - That is a very good question. Most of the reading/writing happens in the temporary directory. That is why we implemented the argument -x, so people can redirect all the heavy computation wherever they find appropriate (e.g., somewhere you know you can store huge files). So your idea is great, let me know if it worked out for you!

And do let me know if you stumble into any issues while using the pipeline. Best of luck with your analyses!

Cheers, Josué.

LotharukpongJS commented 1 year ago

Dear Paul,

I would also like to add that we have mouse gene age maps "phylomaps" already from the GenEra paper and I also just uploaded the human phylomap that we produced when testing GenEra before the publication. I wrapped a few of these "phylomaps" up in a small R data package https://github.com/LotharukpongJS/phylomapr and this might make your life easier :)

devtools::install_github("LotharukpongJS/phylomapr")
data("Homo_sapiens.PhyloMap")
data("Mus_musculus.PhyloMap")
> Homo_sapiens.PhyloMap
# A tibble: 20,598 × 2
   Phylostratum GeneID                   
          <dbl> <chr>                    
 1            1 sp|A0A024RBG1|NUD4B_HUMAN
 2            1 sp|A0A075B6H7|KV37_HUMAN 
 3            1 sp|A0A075B6H8|KVD42_HUMAN
 4            1 sp|A0A075B6H9|LV469_HUMAN
 5            1 sp|A0A075B6I0|LV861_HUMAN
 6            1 sp|A0A075B6I1|LV460_HUMAN
 7            1 sp|A0A075B6I3|LVK55_HUMAN
 8            1 sp|A0A075B6I4|LVX54_HUMAN
 9            1 sp|A0A075B6I6|LV150_HUMAN
10            1 sp|A0A075B6I7|LV548_HUMAN
# ℹ 20,588 more rows
# ℹ Use `print(n = ...)` to see more rows

Not sure what Cie.. is but good luck on the analysis!

Best, Sodai

Proginski commented 1 year ago

Hi,

Thank you Sodai for the phylomaps, which I will probably use at some point !

By "and Cie" I meant "etc..." since I am currently processing quite a few genomes with genEra, in part to be comfortable with the tool. (I could share all the run stats with you if it can be of any use).

For small genomes, I do not have issues at the moment, but as I had already launched Hsap and Mmus, I realize it is gonna take me too much time at the current pace.

For Hsap : Starting time of run: Tue Aug 22 15:48:46 CEST 2023 Step 1 : 9606_Diamond_results.bout done in around 20 hours (630 Go) Step 2 : 9606_ncbi_lineages.csv done in around 8 hours

STARTING STEP 3: ASSIGNING AGES TO YOUR QUERY GENES WITH Erassignment

Running Erassignment using 95 threads

Step 3 : 9606_gene_ages.tsv has only 1 711 / 145K after 4 days and 13 hours

Everything has gone well so far, no warning, no error message, but still at this pace, I'm having my results in a year !! For Mmus it is quite the same, but a bit "faster" (~ 4000 genes done) Is there a way I could try to speed up this last step?

Paul

RocesV commented 1 year ago

Dear @Proginski,

I have co-worked with @josuebarrera @LotharukpongJS during the last three months and as you report i noticed that the step 3 could be faster.

I am currently working on a new step3 speed improvement implementation that takes ~ 1.5 days / 50k genes to assign ages. The bad thing is that more RAM and temporary files are needed (210 GB RAM for 180 GB diamond output and as much tmp files as total number of genes).

I think it should be ready in one-two weeks as a pull request. Thank you all very much for your attention! 😄

Cheers,

Víctor

Proginski commented 1 year ago

Dear Victor,

You guys have already done a great job. It is unfortunate indeed that this last step takes so many resources since what I find brilliant in the idea of doing phylostratigraphy with Diamondv2 is precisely to make it accessible to almost everyone. I am sure you will find a way to improve it with time.

As I said, I am currently exploring genEra features because I am thinking of a possible application to my particular topic (I mentioned it in Ferrara with Josue). I will contact you guys to see if we can do something about it. And so of course I can't wait to try this new PR ;)

Best,

Paul

RocesV commented 11 months ago

Dear @Proginski ,

The new pull-request is already merged so feel free to give it a try and if you find any issues let me know! 😄 You can check further technical details in the pull-request with a toy example. I hope that this faster implementation can help in your research!

Cheers,

Víctor

josuebarrera commented 11 months ago

Dear @Proginski,

As stated by @RocesV, we just released the new version of GenEra with the added features of an ultra-fast mode for step 3 and the integration of infraspecies-level gene age assignments to detect recently evolved genes. I just did some tests, and fast mode seems to work perfectly fine! And the results obtained are virtually the same as when using previous versions of GenEra. You can download the latest release here. I'll close this issue thread for now, but please let us know if you stumble into any other bugs or problems using GenEra.

Cheers, Josué.