rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Make annotation sub-rules optional #140

Closed jmtsuji closed 3 months ago

jmtsuji commented 5 months ago

Currently, unless the --until snakemake flag is used, rotary will run all steps in the annotation module. At the moment, these steps include:

Some of these steps are quite time (or memory) consuming or need a lot of disk space for DB download, and these factors can make testing difficult. Also, some users might just want a finished genome without all the detailed annotations.

@LeeBergstrand What would you think about adding a section to the config file where the desired annotations can be specified? Something like:

# Annotation sub-rules for rotary to run (comment out or delete lines to skip annotation types)
annotations:
- functional_annotations_DFAST
- functional_annotation_EggNOG_mapper
- taxonomy_assignment_GTDBTk
- quality_check_CheckM2 # Must be set for automated quality check to run
- read_coverage

Regarding DFAST: I think we should always provide gene/ORF calls for users, but I think functional annotation by DFAST is something we could potentially make optional. I checked, and it is possible to run DFAST with a --no_func_anno flag to skip functional annotation. Thus, I am thinking that the functional_annotations_DFAST point could be used to toggle DFAST's --no_func_anno flag.

I think it might be fairly easy to implement this optional annotation concept. Most of the options could be implemented by just by adding some conditionals to the summarize_annotation rule. To implement the --no_func_anno version of DFAST, we might need to either add conditions to the run_dfast rule or make a second rule called something like run_dfast_simple. Not sure if the --no_func_anno version of DFAST would still need a DB dir... we would need to test this.

One benefit of adding optional annotation is that we would not need to worry as much about supporting --until circularize as a common use case. I am using --until circularize at the moment to bypass the annotation module.

@LeeBergstrand Let me know your thoughts about this optional annotation sub-rule idea -- thanks!

jmtsuji commented 5 months ago

To implement the --no_func_anno version of DFAST, we might need to either add conditions to the run_dfast rule or make a second rule called something like run_dfast_simple. Not sure if the --no_func_anno version of DFAST would still need a DB dir... we would need to test this.

I just tested running DFAST in --no_func_anno mode. It does not need a DB dir and finishes in ~10 seconds with 4 threads. To me, this supports that having the functional_annotations_DFAST param above might be a helpful feature, if we decide to implement the optional annotation sub-rules like I've proposed.

LeeBergstrand commented 5 months ago

@LeeBergstrand What would you think about adding a section to the config file where the desired annotations can be specified? Something like:

Annotation sub-rules for rotary to run (comment out or delete lines to skip annotation types) annotations:

  • functional_annotations_DFAST
  • functional_annotation_EggNOG_mapper
  • taxonomy_assignment_GTDBTk
  • quality_check_CheckM2 # Must be set for the automated quality check to run
  • read_coverage

It might be easier to specify things as a list like we did for contamination_references_ncbi_accessions:

# Select Annotations: DFAST_Func (light annotation), EggNOG (heavy annotation including KEGG), 
# GTDBTk (taxonomy), CheckM2 (genome quality), coverage (read coverage statistics)
annotations: ['DFAST_Func', 'EggNOG', 'GTDBTk', 'CheckM2', 'coverage']

Would it be easier to add new annotations and more accessible for people to understand without remembering the rule names?

@LeeBergstrand Let me know your thoughts about this optional annotation sub-rule idea -- thanks!

We should always provide gene/ORF calls for users and make the DFAST functional annotations optional.

jmtsuji commented 5 months ago

@LeeBergstrand Thanks for the feedback!

It might be easier to specify things as a list like we did for contamination_references_ncbi_accessions:

# Select Annotations: DFAST_Func (light annotation), EggNOG (heavy annotation including KEGG), 
# GTDBTk (taxonomy), CheckM2 (genome quality), coverage (read coverage statistics)
annotations: ['DFAST_Func', 'EggNOG', 'GTDBTk', 'CheckM2', 'coverage']

I like this approach 👍 The simplified annotation names you suggested are also great.

We should always provide gene/ORF calls for users and make the DFAST functional annotations optional.

OK, sounds good!

I'll plan to make a PR for this in the near-ish future. I'm working on code for the rotate/stitch modules at the moment, so I might not get to this PR right away.

LeeBergstrand commented 3 months ago

Addressed in https://github.com/rotary-genomics/rotary/pull/154