Get gtdbtk to use dfast proteins as input.

LeeBergstrand commented 5 months ago

gtdbtk has a --genes parameter that allows gtdbtk to use the output of gene prediction pipelines rather than prodigal as input.

This parameter causes gtdbtk to take proteins as input (https://github.com/Ecogenomics/GTDBTk/issues/571).

I'm wondering if the speed-up of skipping prodigal inside the GTDBtk rule is worth it, as it skips the ANI and mash steps and starts putting the found markers in the GTDB trees using pplacer immediately. On my machine, I run out of memory using pplacer. With my dataset, most of the time, the pipeline skips the pplacer step. If the ANI screen finds a close match (you have an organism already in the tree), I think it skips the marker gene insert step and speeds up the pipeline.

@jmtsuji Would using --genes be useful for you as an optional approach?

LeeBergstrand commented 5 months ago

To do the conversion, I had to change the batch file to point at the dfast proteins file rather than the genome file and add the --genes flag.

I also got the following warning:

The final classification predicted may be less accurate due to the use of amino acid files instead of nucleotide files as input to the pipeline. Without nucleotides files, the ANI classification step of the workflow has been skipped and therefore no ANI matches with existing species in GTDB could be reported.

jmtsuji commented 5 months ago

@LeeBergstrand Thanks for this idea and the extra context! Just to confirm, are the key reasons for exposing the --genes flag to provide an annotation speedup (by skipping Prodigal) and to force execution of pplacer (skipping the ANI search, if a user wants to skip this, e.g., for running tests)?

On my end, aside from those possible benefits, I can see the following possible disadvantages:

Because the GTDB-Tk was designed with Prodigal-based annotations in mind, I wonder if it might affect benchmarks a little bit by providing annotations from other tools like DFAST. (For example, it seems like the GTDB-Tk team did not want to jump on switching from Prodigal to Pyrodigal without some testing: https://github.com/Ecogenomics/GTDBTk/issues/456 )
Am I correct that setting --genes skips the ANI step entirely? If so, I wonder if using --genes might ultimately cost more time (by forcing pplacer) compared to just re-annotating the genome with Prodigal, for average use cases.

Weighing these advantages and disadvantages, I wonder if most users would not need to use --genes. What do you think? Are there other advantages you foresee? Or do you think this setting might be useful for end-to-end tests and benchmarking of rotary? Thanks!

LeeBergstrand commented 3 months ago

On my end, aside from those possible benefits, I can see the following possible disadvantages:

I think, given the issues you brought up and the fact that Prodigal runs fast enough, most of the advantages of running --genes would be mitigated by the slowdown from not doing the ANI search. We can explore this later. Closing for now.

rotary-genomics / rotary

Get gtdbtk to use dfast proteins as input. #142