shandley / hecatomb

hecatomb is a virome analysis pipeline for analysis of Illumina sequence data
MIT License
54 stars 12 forks source link

Enhancement: add whitelist to pre-processing #84

Open dhoconno opened 1 year ago

dhoconno commented 1 year ago

Thanks for this very useful workflow. To reduce runtimes, can I suggest adding a 'whitelist' rule to preprocessing? This could reduce runtimes considerably in situations where the targets are limited (e.g., only interested in known human viruses).

I think the implementation could be straightforward:

Thanks for your consideration!

beardymcjohnface commented 1 year ago

Hi, This is an interesting suggestion. Do you think having an option to use a custom primary database for the viruses of interest would work? The primary searches do essentially what you're suggesting, but for all viruses, and the secondary multi-kingdom searches weed out the false positives from this reduced pool of sequences.

dhoconno commented 1 year ago

Yep, absolutely. Depending on how much database prep is needed for the primary database, I could envision situations where providing a FASTA whitelist file would be simpler and wouldn't require modifying the virus database. If the primary database is already just a FASTA file of all viruses, then specifying a custom FASTA file of, say, all human viruses would be great.