Closed LeeBergstrand closed 3 months ago
To do the conversion, I had to change the batch file to point at the dfast
proteins file rather than the genome file and add the --genes
flag.
I also got the following warning:
The final classification predicted may be less accurate due to the use of amino acid files instead of nucleotide files as input to the pipeline. Without nucleotides files, the ANI classification step of the workflow has been skipped and therefore no ANI matches with existing species in GTDB could be reported.
@LeeBergstrand Thanks for this idea and the extra context! Just to confirm, are the key reasons for exposing the --genes
flag to provide an annotation speedup (by skipping Prodigal) and to force execution of pplacer (skipping the ANI search, if a user wants to skip this, e.g., for running tests)?
On my end, aside from those possible benefits, I can see the following possible disadvantages:
--genes
skips the ANI step entirely? If so, I wonder if using --genes
might ultimately cost more time (by forcing pplacer) compared to just re-annotating the genome with Prodigal, for average use cases.Weighing these advantages and disadvantages, I wonder if most users would not need to use --genes
. What do you think? Are there other advantages you foresee? Or do you think this setting might be useful for end-to-end tests and benchmarking of rotary? Thanks!
On my end, aside from those possible benefits, I can see the following possible disadvantages:
I think, given the issues you brought up and the fact that Prodigal runs fast enough, most of the advantages of running --genes
would be mitigated by the slowdown from not doing the ANI search. We can explore this later. Closing for now.
gtdbtk has a
--genes
parameter that allows gtdbtk to use the output of gene prediction pipelines rather than prodigal as input.This parameter causes gtdbtk to take proteins as input (https://github.com/Ecogenomics/GTDBTk/issues/571).
I'm wondering if the speed-up of skipping prodigal inside the GTDBtk rule is worth it, as it skips the ANI and mash steps and starts putting the found markers in the GTDB trees using pplacer immediately. On my machine, I run out of memory using pplacer. With my dataset, most of the time, the pipeline skips the pplacer step. If the ANI screen finds a close match (you have an organism already in the tree), I think it skips the marker gene insert step and speeds up the pipeline.
@jmtsuji Would using
--genes
be useful for you as an optional approach?