oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
430 stars 51 forks source link

Increase verboisity for invalid --proteins options #322

Open NonAggressiveHail opened 1 day ago

NonAggressiveHail commented 1 day ago

When providing an incorrectly formatted genbank or fasta file with the --proteins option no information is given on why the file in invalid.

I attempted to annotate a genome with these reference proteins, downloaded as GenBank file format. When bakta annotates CDS the error is given "ERROR: User proteins file GenBank format not valid!". Rerunning with --debug option I expected more information on how this file is invalid so I can repair it, but no further information is given. Ideally, more information should be given on what aspects of the file are invalid so they can be repaired.

I have also tried converting the Genbank file to a fasta file with prokka's prokka-genbank_to_fasta_db, however using this gave the error "ERROR: User proteins file Fasta format not valid!". Again it would be helpful for this to provide further information on why it is invalid, or for bakta to include a utility with similar functionality which works correctly.

NonAggressiveHail commented 22 hours ago

I am not sure if there is potentially also a bug at play here. For example I tried the following fasta file >WP_003116930 aes~~~aes~~~CDD:400284 MALNPDIAAYLELVGNGRSSGKSLPMHQLTVQQAREQFDQSSALMDPGLDEPLARVETLFVPARDGTPLP ARLYSPQGLSASPPLPGVLYLHGGGYVVGSLDSHDALCASLAERAGCVVLSLAYRLAPEWRFPTAAEDAE DAWCWLAAEAARLGIDPQRLAVAGDSVGGSLCAVLSHRLALRGEASQPRLQVLIYPVTDASRTHQSIERY AVGHLLEKDSLEWFYQHYQRSPEDRQDPRFSPLLGVVPADLAPTLLLVAECDPLHDEGIAYAEHLRQGGA RVELCVYPGMTHDFLRMGAIVDEADDAKDMIADALVAALAT

And whilst running I do not get the same error, however I do get the following output: predict & annotate CDSs... predicted: 5682 discarded spurious: 0 revised translational exceptions: 0 detected IPSs: 5547 found PSCs: 128 found PSCCs: 4 lookup annotations... conduct expert systems... amrfinder: 8 protein sequences: 656 user protein sequences: 0

I am surprised to see user protein sequences 0, when I would expect it to be 1