josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 6 forks source link

`ncbitax2lin` memory fix #8

Closed glarue closed 1 year ago

glarue commented 1 year ago

Hi @josuebarrera,

Two things:

  1. You note in the README that you have uploaded a compressed lineages file for people having memory issues with ncbitax2lin, but I can't seem to find the link/file itself. Have I overlooked something?
  2. I was able to avoid the ncbitax2lin memory usage issue by modifying the fmt.py file from ncbitax2lin to reduce the number of workers in the call to concurrent.futures.ProcessPoolExecutor() (which defaults to the number of CPUs on the system) by adding max_workers=<n>, where <n> is some lower number e.g.,
with concurrent.futures.ProcessPoolExecutor() as executors: -> with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executors:
josuebarrera commented 1 year ago

Hello @glarue!

Thanks for pointing this out, I completely forgot to upload the file. You can find it here. Note that I had to use lrzip for file compression due to size restrictions in GitHub (files should be < 25MB), so you will need to install lrzip to uncompress the file before feeding it to GenEra. Regarding the memory problem with ncbitax2lin, I cannot modify the source code of fmt.py, but we can open an issue so the owner of the repository can implement this modification.

glarue commented 1 year ago

@josuebarrera thanks!

I mentioned the modification to ncbitax2lin mostly in the hopes it might help someone else with a quick fix. Ideally, a full PR for that repo would involve propagating that option to the command line arguments for ncbitax2lin, which isn't something I was inclined to add myself although it would be trivial enough to do so.