Closed Alexander-Jorjorian closed 4 months ago
Hey @Alexander-Jorjorian,
Thanks for submitting this! Would you mind sharing where you're at in this implementation process (pull request, etc)? Additionally, are you aware of Pipeline Parallelization? This was functionality that we added in QIIME 2 2023.5 that may support exactly what you're hoping to achieve (if you haven't already fully implemented this in another way).
I'm going to close this as we haven't heard back. But feel free to reopen at any point @Alexander-Jorjorian!
Improvement Description
Have find_consensus_annotation process the blast6 formatted taxonomy hits from the classifiers in a parallelized batch manner.
Current Behavior
Currently, this function processes the entire taxonomy hit file in a single-threaded manner and holds all of the hits in memory.
Proposed Behavior
Instead, split the input into a list of data frames and process these in batches using parallelism. This will have two benefits: it will allow us to use multiple CPUs as we do for the blast functionality.
Questions
Are there any reasons why this approach is not viable?
Comments
I am currently working on implementing this functionality. The main benefit will be improved performance and stability on large, complex datasets.