naobservatory / mgs-workflow

MIT License
4 stars 2 forks source link

`GET_HV_DESCENDENTS` is very slow #27

Open willbradshaw opened 5 months ago

willbradshaw commented 5 months ago

Now that INDEX:DOWNLOAD_BLAST_DB reads from AWS, the biggest bottleneck in the index workflow is INDEX:MAKE_HUMAN_VIRUS_DB:GET_HV_DESCENDENTS. This relies on the NCBI script gimme_taxa.py, which takes a list of taxids and returns a list of all descendent taxids (including some not included in the taxonomy dump files obtained by INDEX:GET_NCBI_TAXONOMY:DOWNLOAD_NCBI_TAXONOMY). This script is very slow.

It seems like this shouldn't be too hard to replicate independently in a much faster way. Doing so should allow the index workflow to run in a single workday, rather than overnight.

This is a lower priority than bottlenecks in the main workflow, since the index workflow isn't run very often. But it would still be helpful to make it faster.

mikemc commented 5 months ago

How big is the output of gimme_taxa.py? Would it help to just generate and save that during index generation, then copy it over along with the index? (Apologies if I'm not understanding the current process here and my question doesn't make sense)

Otherwise speeding this up seems doable and perhaps even fun :)

Taking a quick look at the python script, one easy option might be to parallelize the first for loop, https://github.com/naobservatory/mgs-workflow/blob/302d2d5b87cd81b82c3dc296be71f78518c3f3e8/modules/local/getHvDescendents/main.nf#L29

though I'm guessing the bigger gains can be had.

Is this the correct source of gimme_taxa.py? or does the version you're using come directly from NCBI?

willbradshaw commented 5 months ago

Would it help to just generate and save that during index generation, then copy it over along with the index?

I might be misunderstanding your question, but this is part of the index workflow so it is already generated and saved during index generation.

Is this the correct source of gimme_taxa.py?

That looks right. It's this line in the container file:

RUN pip install ncbi-genome-download

mikemc commented 5 months ago

Ok, thanks for the clarifications! Given that this is only run in the index workflow, it may not be worth optimizing the code to be more efficient; we can still speed things up by parallelizing the for loop that is creating the human_viruses dictionary (assuming this process is being run on a multi-core instance). I'm not familiar with the functions/packages for doing this in Python, but based on my experience in R it should be straightforward.

@jeffkaufman do you have a recommended package/paradigm in python for parallelizing a for loop, where each iteration is adding a key:item pair to a dictionary?

jeffkaufman commented 5 months ago

First, an infrastructure thing: It looks to me like GET_HV_DESCENDENTS should be calling a python script, and not including a script inline. For ~5 lines or so inline is fine, but this is much longer. Having it separate would allow us to (a) have better text editor support and (b) run and test it independently.

That said, to parallelize this I would use multiprocessing. This looks like:

  1. Convert the script to not do anything on pure import (use the if __name__ == "__main__" convention), since multiprocessing imports __file__ in the children.
  2. Pull the fetching code out into a function
  3. Use a multiprocessing Pool.

Something like (not run or tested):

from multiprocessing import Pool

# This function is run in a separate process, so don't communicate anything through
# global state modifications 
def get_descendants(human_virus_taxid):
  # call gimme taxa
  ...
  # parse results
  ...
  return human_virus_taxid, descendant_taxids

def start():
  human_viruses = load_human_viruses()
  taxid_descendents = defaultdict(list)
  taxid_unique = set(human_viruses)

  # I had initially thought the code used a remote DB and was blocked on networking,
  # but reading the ete3 documentation it looks like it downloads the DB locally.  So
  # this should be set to the number of cores (as you want for CPU-bound processes)
  # and not something higher (what you want for slow-remote-server-bound processes).
  pool = Pool(processes=32)
  for human_virus_taxid, descendant_taxids in pool.map(get_descendants, human_viruses):
    taxid_unique.update(descendant_taxids)
    taxid_descendents[human_virus_taxid] = descendant_taxids

  # write output
  ...

if __name__ == "__main__":
  start()

Looking at https://github.com/kblin/ncbi-genome-download/blob/master/contrib/gimme_taxa.py I think you could also simplify this a lot by staying in python instead of going to shell. The gimme_taxa script seems to be a wrapper around ncbi.get_descendant_taxa.

One thing that could go wrong here is that the very first call will download the database (more). If you've already started multiprocessing at that stage you might corrupt the DB, or at least download it 32x. So the script should run gimme_taxa.py or call NCBITaxa() before starting its pool.