oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
432 stars 53 forks source link

cmmscan for rRNA identification is probably configured incorrectly and does NOT use CM-only mode #315

Closed richardstoeckl closed 1 month ago

richardstoeckl commented 1 month ago

Hi Oliver,

I was looking through the code that is used to identify rRNA genes. While doing so, I noticed that for the v1 release, you switched to a cmscan approach with the parameter --nohmmonly with the comment "# strictly use CM models". I assume this means cmscan is supposed to run in CM-only mode.

However, in the manual for infernal, there is no --nohmmonly mode, the only options similar to this are --nohmm and --hmmonly`, which do exact opposite things!

(see page 89 of the User guide): grafik

I assume from the in-line comment, that you want to use the --nohmm mode, so I tested which mode would be chosen by cmscan when the incorrect --nohmmonly option is given:

cmscan --nohmmonly --verbose -g SSU_rRNA_archaea.cm genome.fasta
# cmscan :: search sequence(s) against a CM database
# INFERNAL 1.1.5 (Sep 2023)
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query sequence file:                   genome.fasta
# target CM database:                    SSU_rRNA_archaea.cm
# CM configuration:                      glocal
# verbose output mode:                   on
# HMM-only mode for 0 basepair models:   no
# number of worker threads:              4
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

compared to the --nohmm option:

cmscan --nohmm --verbose -g SSU_rRNA_archaea.cm genome.fasta 
# cmscan :: search sequence(s) against a CM database
# INFERNAL 1.1.5 (Sep 2023)
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query sequence file:                   genome.fasta
# target CM database:                    SSU_rRNA_archaea.cm
# CM configuration:                      glocal
# verbose output mode:                   on
# CM-only mode:                          on [HMM filters off]
# truncated hit detection:               off [due to --nohmm]
# number of worker threads:              4
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Unfortunately, it seems like cmscan does not fail with an error when the incorrect --nohmmonly option is given, but instead just uses the default mode? Therefore I think bakta has not been using the CM-only mode as you probably intended.

Thanks and best wishes, Richard

richardstoeckl commented 1 month ago

Nvm, just as I proof read everything and posted the Issue, I found the --nohmmonly option...

oschwengers commented 1 month ago

Anyway, thanks a lot Richard for the effort!