rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Refactor quality report generation to use wildcards #181

Closed LeeBergstrand closed 2 weeks ago

LeeBergstrand commented 1 month ago

Change the generation of quality reports to a callback function that determines what FASTQC files must be generated. FASTQC files are generated one at a time, and one FASTQC job is spawned per raw or qced FASTQ file. Logic in the callback function allows for creating FASTQC files for adapter trimming and decontamination only when these steps occur. This addresses concerns brought up in https://github.com/rotary-genomics/rotary/pull/161#pullrequestreview-2087595363.

LeeBergstrand commented 1 month ago

@jmtsuji This addresses the issue where FASTQC would be run on qc stages turned off in the config (e.g., adapter trimming and contamination filtering). This is determined inside the generate_fastqc_file_list callback function.

This code needs to be tested live; I've only tested it with a dry run.

LeeBergstrand commented 1 month ago

I had to change the directory structure of the raw read folder by splitting it into short and long read sub-directories. This made it more compatible with wild cards.

@jmtsuji Note that there are two rules for running fastqc (raw and quality controlled) right now. I had trouble combining them and getting the wild cards to work. To merge the two rules, I would have to drastically change the names of the output QC files. After several attempts, I have decided to leave these rules as is.

LeeBergstrand commented 1 month ago

@jmtsuji Looking forward to your review! Any feedback would be appreciated.

LeeBergstrand commented 2 weeks ago

@LeeBergstrand This is a nice addition to allow some QC steps to be optional and still have FastQC / MultiQC work. The logic of generate_fastqc_file_list and its interactions with the previous and subsequent rules is pretty dense, so it's hard for me to gauge how smoothly the function will work without a test using real data. Because I don't see any clear issues with the code, I will go ahead and approve this. However, we should make sure to test this with a real dataset (e.g., with various QC steps toggled on/off) later, before merging to main, to confirm there are no issues. Sounds good? Thanks again for this code addition?

P.S. I think it's OK to leave the 2 separate functions for FastQC like you've currently done 👍

So FASTQC makes html and zip files. This function basically generates what FASTQC files will be produced depending on the QC stages ran for both short and long reads. It uses the wild cards to figure out if it is being invoked for short on long reads. If it is short, it creates each of the outputs for R1 and R2 files and if long only the outputs for long files. For the short files it creates additional output for the contamination and adapter trimming qc stages if they used. As indicated by global variables.

LeeBergstrand commented 2 weeks ago

@jmtsuji Will merge in and we can test later. I've done some limited testing where I remove all the contamination references.

jmtsuji commented 2 weeks ago

@LeeBergstrand Thanks for the additional background! Also, I'm glad this is working when all contamination references are removed.

Sounds good to merge for now. We can check this code in more detail via testing later on.