Profiling eukaryote contigs?

timghaly commented 9 months ago

This looks like a great tool.

I'm wondering though how well metabuli would perform classifying environmental mciroeukaryotes. Particularly, because it looks like that Prodigal is used to generate the databases, which is not ideal for euk gene predictions. Would metabuli outperform MMSeqs2 Taxonomy with nr database for assigning euk taxonomy to contigs? If so, what metabuli database would be best suited?

Thanks!

jaebeom-kim commented 9 months ago

Thank you for reaching out! Great question!

1. You are right. Prodigal is developed for prokaryote genomes, so its predicted ORFs of eukaryotes are not meaningful. However, even with the wrong ORFs, exact DNA 24-mer matches can be still found because query read is translated with all possible frames. So, I'd like to say Metabuli can be as good as other DNA k-mer-based tools for eukaryotes.

2. If you use protein-based search like MMseqs2, reads from intergenic region cannot be mapped to any sequence in database. If your contigs are long enough to contain at least one protein coding gene, it would be fine. For eukaryotes, Metabuli's advantage over MMseqs2 will be the ability to use the non-coding / intergenic regions. However, we don't have any pre-built eukaryote database, yet. We are planning to provide an index using NCBI's nt database. I hope it will help you.

Thank you again:)

timghaly commented 9 months ago

Okay, great. Thanks for you answer. I will give it a go after you release the indexed nt database.

Thanks for you help!

JonathonMifsud commented 9 months ago

+1 on an indexed nt database, this would be very useful!

steineggerlab / Metabuli

Profiling eukaryote contigs? #54