Currently, rule summarize_contigs_by_coverage in rotary.smk filters which contigs are retained from an assembly based on short/long read mapping stats. However, a log of what is done is not printed out. Instead, after the run, the user has to manually compare the contigs listed in polish/cov_filter/filtered_contigs.list (post-filter) to the contigs in assembly/assembly_info.txt (pre-filter) to see what was dropped. This is not good, because it is important information to know if contigs were dropped during the pipeline.
Proposed solution
Add logging to summarize_contigs_by_coverage that reports:
A summary of the coverage stats (long and short read) for each contig
Basic stats about the contig (e.g., circular stats, length)
The decision to keep or drop the contig
Later on, a mapping file should also be made that maps this info to the contigs in the final annotated genome (where the contig names are revised by the annotation pipeline). This will allow the log file to be more understandable. (A separate issue could be made for this in the future if it turns out not to be easy to address.)
Possible caveats
Right now, long or short read coverage are only calculated for the pre-filtered contigs if the user specifies to filter by long or short read coverage, respectively. Perhaps both should be calculated (for reference) if any kind of coverage-based filtration is requested.
Also, I need to check if there is an easy way to map the names between the pre-annotated and post-annotated contigs. If not, then some kind of alignment-based approach might be needed to identify the contigs.
Problem description
Currently, rule
summarize_contigs_by_coverage
in rotary.smk filters which contigs are retained from an assembly based on short/long read mapping stats. However, a log of what is done is not printed out. Instead, after the run, the user has to manually compare the contigs listed inpolish/cov_filter/filtered_contigs.list
(post-filter) to the contigs inassembly/assembly_info.txt
(pre-filter) to see what was dropped. This is not good, because it is important information to know if contigs were dropped during the pipeline.Proposed solution
Add logging to
summarize_contigs_by_coverage
that reports:Later on, a mapping file should also be made that maps this info to the contigs in the final annotated genome (where the contig names are revised by the annotation pipeline). This will allow the log file to be more understandable. (A separate issue could be made for this in the future if it turns out not to be easy to address.)
Possible caveats
Right now, long or short read coverage are only calculated for the pre-filtered contigs if the user specifies to filter by long or short read coverage, respectively. Perhaps both should be calculated (for reference) if any kind of coverage-based filtration is requested.
Also, I need to check if there is an easy way to map the names between the pre-annotated and post-annotated contigs. If not, then some kind of alignment-based approach might be needed to identify the contigs.