treangenlab / emu

MIT License
33 stars 5 forks source link

How are novel species annotated? #21

Closed ashleyp1 closed 3 weeks ago

ashleyp1 commented 2 months ago

Thanks for making this great tool! I have a clarifying question I haven't been able to find an answer to. How does EMU handle novel species?

For example, I'm working with soil microbiomes that have been inoculated with a microbial consortium. I know I have a novel Chryseobacterium species that is pretty firmly in the genus Chryseobacterium but is otherwise quite different from all known species of that genus. Using a tool like sintax, my read assignments for this strain only go down to the genus level because its not confident on the species. However, EMU appears to always go down to the species level for read assignments. Does EMU have any way to determine the confidence for each level of phylogeny? Can it say a read belongs to genus Chryseobacterium species unknown or will it always assign a species?

Thanks!

kdc10 commented 1 month ago

It depends on the minimap2 alignment. If the alignment is good enough that minimap2 makes the alignment, a classification will be forced (if using the default emu database, this will be at the species level since all database reads are at the species level). If no alignment is strong enough for minimap2 to make a call, then the sequence is left unclassified. We demonstrate this in the publication (https://www.nature.com/articles/s41592-022-01520-4) where a novel Romboutsia species shows to be classified as an alternative Romboutsia species that is in the database. If your study prefers a different specificity/sensitivity tradeoff, you can alter the minimap2 parameters.

ashleyp1 commented 1 month ago

Do you have any recommendations for which minimap2 parameters to change? I'm not as familiar with it and there seems to be a lot of options.

kdc10 commented 1 month ago

Minimal peak score (-s) may be the most straightforward or perhaps running alignment score drop (-z). Alternatively, you could use your minimap output sam file, apply a filter for a minimum chaining score (s1; I suspect samtools should be able to handle this), then run emu on your filtered sam file. With any option, it will take a bit of thought and effort to establish a desired threshold though.