rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Add QC Stats #161

Closed LeeBergstrand closed 1 month ago

LeeBergstrand commented 2 months ago

Add QC stats to Rotary.

jmtsuji commented 2 months ago

@LeeBergstrand Like I mentioned in #91 , this is a nice start toward making QC summary reports. Here are a few observations of mine after testing this code:

If you agree that this PR is a good way to address #91, then we can start to refine this PR including addressing the comments above. Thanks again!

LeeBergstrand commented 2 months ago

@LeeBergstrand Like I mentioned in #91 , this is a nice start toward making QC summary reports. Here are a few observations of mine after testing this code:

  • It would be ideal to also run FastQC on the raw reads (before input to the QC module)
  • If we want to run FastQC on intermediate QC steps like you have done in this draft code, we will need to modify the FastQC rules somehow so that each intermediate QC step is optional. Depending on how the user sets up their config file, some of the QC steps might be skipped, and I think this would produce an error in the current code. I can think of two different ways to make FastQC for intermediate steps optional:

    • Keep the current fastqc rule but add more complex logic to each input variable
    • Change the current fastqc rule so that it runs on each FastQ output file during QC, one file at a time, based on wildcards. In this case, the DAG would just figure out when to run the fastqc rule dynamically.
  • Once ready, we should add another rule that runs MultiQC. Based on my quick testing, it would be ideal to generate separate MultiQC reports for short reads and long reads.

If you agree that this PR is a good way to address #91, then we can start to refine this PR including addressing the comments above. Thanks again!

Sounds good. I'll pursue this.

LeeBergstrand commented 1 month ago

@jmtsuji This is essentially a preliminary pass of generating QC stats for Rotary.

  1. Run FASTQC on short and long reads.
  2. Run MultiQC on short and long FASTQC outputs.

Caveats

Does this look good enough to merge in for now?

@jmtsuji Do you have any comments or suggestions for improvements?

LeeBergstrand commented 1 month ago

@jmtsuji Requesting merge of https://github.com/rotary-genomics/rotary/pull/173 (assembly stats using quast) into this branch.

jmtsuji commented 1 month ago

@LeeBergstrand Thanks so much for these updates. Things are a bit crazy on my end at the moment, but I'll take a look at #161 and #173 by the end of this week. Let me know if a code review is urgent, and I can take a look sooner. Thanks!

LeeBergstrand commented 1 month ago

@LeeBergstrand Thanks so much for these updates. Things are a bit crazy on my end at the moment, but I'll take a look at #161 and #173 by the end of this week. Let me know if a code review is urgent, and I can take a look sooner. Thanks!

@jmtsuji End of the week should be fine.

jmtsuji commented 1 month ago

@jmtsuji Requesting merge of #173 (assembly stats using quast) into this branch.

@LeeBergstrand Approved -- feel free to merge this in. I can run an end-to-end test of the merged code in this branch once we're satisfied with the FastQC reports.

LeeBergstrand commented 1 month ago

@jmtsuji Great!

LeeBergstrand commented 1 month ago

My only concern is that rule run_fastq_short might error out if the user turns off one of the short read QC steps (e.g., adapter trimming) in the config file, because QC_SHORT_FILE_TYPES is hard-coded to include '_reformat_', '_adapter_trim_', '_quality_trim_'. Do you agree that this will likely be an issue? (I haven't tested the code to confirm, but I can do so if needed.)

This can be addressed by changing perform_adapter_trimming into a variable flag and then using that to turn off _adaptertrim as an input. I'm considering creating a function to address creating inputs.