Open willbradshaw opened 5 months ago
How big is the output of gimme_taxa.py
? Would it help to just generate and save that during index generation, then copy it over along with the index? (Apologies if I'm not understanding the current process here and my question doesn't make sense)
Otherwise speeding this up seems doable and perhaps even fun :)
Taking a quick look at the python script, one easy option might be to parallelize the first for loop, https://github.com/naobservatory/mgs-workflow/blob/302d2d5b87cd81b82c3dc296be71f78518c3f3e8/modules/local/getHvDescendents/main.nf#L29
though I'm guessing the bigger gains can be had.
Is this the correct source of gimme_taxa.py
? or does the version you're using come directly from NCBI?
Would it help to just generate and save that during index generation, then copy it over along with the index?
I might be misunderstanding your question, but this is part of the index workflow so it is already generated and saved during index generation.
Is this the correct source of gimme_taxa.py?
That looks right. It's this line in the container file:
RUN pip install ncbi-genome-download
Ok, thanks for the clarifications! Given that this is only run in the index workflow, it may not be worth optimizing the code to be more efficient; we can still speed things up by parallelizing the for loop that is creating the human_viruses
dictionary (assuming this process is being run on a multi-core instance). I'm not familiar with the functions/packages for doing this in Python, but based on my experience in R it should be straightforward.
@jeffkaufman do you have a recommended package/paradigm in python for parallelizing a for loop, where each iteration is adding a key:item pair to a dictionary?
First, an infrastructure thing: It looks to me like GET_HV_DESCENDENTS
should be calling a python script, and not including a script inline. For ~5 lines or so inline is fine, but this is much longer. Having it separate would allow us to (a) have better text editor support and (b) run and test it independently.
That said, to parallelize this I would use multiprocessing. This looks like:
if __name__ == "__main__"
convention), since multiprocessing imports __file__
in the children.Something like (not run or tested):
from multiprocessing import Pool
# This function is run in a separate process, so don't communicate anything through
# global state modifications
def get_descendants(human_virus_taxid):
# call gimme taxa
...
# parse results
...
return human_virus_taxid, descendant_taxids
def start():
human_viruses = load_human_viruses()
taxid_descendents = defaultdict(list)
taxid_unique = set(human_viruses)
# I had initially thought the code used a remote DB and was blocked on networking,
# but reading the ete3 documentation it looks like it downloads the DB locally. So
# this should be set to the number of cores (as you want for CPU-bound processes)
# and not something higher (what you want for slow-remote-server-bound processes).
pool = Pool(processes=32)
for human_virus_taxid, descendant_taxids in pool.map(get_descendants, human_viruses):
taxid_unique.update(descendant_taxids)
taxid_descendents[human_virus_taxid] = descendant_taxids
# write output
...
if __name__ == "__main__":
start()
Looking at https://github.com/kblin/ncbi-genome-download/blob/master/contrib/gimme_taxa.py I think you could also simplify this a lot by staying in python instead of going to shell. The gimme_taxa
script seems to be a wrapper around ncbi.get_descendant_taxa
.
One thing that could go wrong here is that the very first call will download the database (more). If you've already started multiprocessing at that stage you might corrupt the DB, or at least download it 32x. So the script should run gimme_taxa.py
or call NCBITaxa()
before starting its pool.
Now that
INDEX:DOWNLOAD_BLAST_DB
reads from AWS, the biggest bottleneck in the index workflow isINDEX:MAKE_HUMAN_VIRUS_DB:GET_HV_DESCENDENTS
. This relies on the NCBI scriptgimme_taxa.py
, which takes a list of taxids and returns a list of all descendent taxids (including some not included in the taxonomy dump files obtained byINDEX:GET_NCBI_TAXONOMY:DOWNLOAD_NCBI_TAXONOMY
). This script is very slow.It seems like this shouldn't be too hard to replicate independently in a much faster way. Doing so should allow the index workflow to run in a single workday, rather than overnight.
This is a lower priority than bottlenecks in the main workflow, since the index workflow isn't run very often. But it would still be helpful to make it faster.