rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

Improve logging of required coverage thresholds for contigs #68

Open jmtsuji opened 11 months ago

jmtsuji commented 11 months ago

Problem description

Currently, rule summarize_contigs_by_coverage in rotary.smk filters which contigs are retained from an assembly based on short/long read mapping stats. However, a log of what is done is not printed out. Instead, after the run, the user has to manually compare the contigs listed in polish/cov_filter/filtered_contigs.list (post-filter) to the contigs in assembly/assembly_info.txt (pre-filter) to see what was dropped. This is not good, because it is important information to know if contigs were dropped during the pipeline.

Proposed solution

Add logging to summarize_contigs_by_coverage that reports:

Later on, a mapping file should also be made that maps this info to the contigs in the final annotated genome (where the contig names are revised by the annotation pipeline). This will allow the log file to be more understandable. (A separate issue could be made for this in the future if it turns out not to be easy to address.)

Possible caveats

Right now, long or short read coverage are only calculated for the pre-filtered contigs if the user specifies to filter by long or short read coverage, respectively. Perhaps both should be calculated (for reference) if any kind of coverage-based filtration is requested.

Also, I need to check if there is an easy way to map the names between the pre-annotated and post-annotated contigs. If not, then some kind of alignment-based approach might be needed to identify the contigs.