ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
301 stars 89 forks source link

[FEATURE REQUEST] (a) Append "Candidatus" prefix if needed automagically. (b) Separate error from general log #228

Closed aboffin closed 1 year ago

aboffin commented 1 year ago

Hi,

First of all, I would like to thank the PGAP/NCBI team for putting together a high quality microbial annotation tool. The documentation is easy to follow and the docker installation works like a charm. I used PGAP a few years ago and I am revisiting it now to annotate a few novel genera.

Is your feature request related to a problem? Please describe. Right now, we need to provide Genus species or Genus in the yaml file or run taxa check/ANI to overrule the mentioned genus. I have a list of genomes of several novel genera identified by GTDBtk. Many of these have Candidatus status in NCBI (but GTDBtk does not add this "Candidatus" prefix) and some of them are "recognized" genera. So even though I know confidently the genera and species of several genomes, I am not sure if I should prefix Candidatus or not --- if I don't add Candidatus for a species that is novel the annotation fails, if by mistake I add Candidatus to another species that is well known the annotation fails.

Describe the solution you'd like While scanning for genus/species level names in the yaml file, attempt to identify correctly if a prefix "Candidatus" is needed and prepend it to the genus name.

Describe alternatives you've considered Run the first list of genomes, identify failures, check their taxonomy manually on NCBI, add Candidatus and try running them again.

Additional context Additionally, one wonders what is the purpose of the enormous amount of garbage collected in cwtool.log . It is certainly not for human consumption and if it is, the error messages are embedded unhelpfully under a pile of yaml/json (?) and general log messages. Is there another tool that people need to use to parse this log file to exactly identify errors that cause failure ? I am genuinely aghast and would like to know how people handle this. Grepping Error/error does not help because there are thousands of ignore_all_errors line spewed across the file. TL;DR: It will be good to separately report errors and general logging messages so that one can easily identify and resolve errors.

Thanks again for the great work!

thibaudnis commented 1 year ago

Hello - Thanks for writing and sorry for the slow response. To help with the first issue, you could use https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi. Enter the organisms on your list, and you'll get back the corresponding NCBI Taxonomy preferred names for each. I recognized this is an additional step, so we'll think whether we could somehow incorporate a conversion, or some sort of fuzzy matching in the PGAP package for cases like yours. As for your second comment, yes, I recognize that cwltoo.log is unwieldy, and needs to be given more thought. It does contain useful information for debugging though.

aboffin commented 1 year ago

@thibaudnis Thank you for your suggestion on Taxidentifier, I think that will certainly take care of the novel genera issue. Thanks again!