rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Report read QC stats #91

Open jmtsuji opened 9 months ago

jmtsuji commented 9 months ago

Make a rule that runs a QC program (e.g., FastQC) on raw reads and after each QC step. A separate rule might be needed for long vs. short reads if the required stats differ.

In addition, make a summary rule that integrates the QC stats from each individual QC rule into a single summary file. I imagine we should make a separate summary file for long reads vs. short reads.

LeeBergstrand commented 8 months ago

@jmtsuji As discussed in https://github.com/rotary-genomics/rotary/issues/92#issuecomment-1858534761, we may want to also compile a bunch of the rule monitoring files (e.g., {sample}_repaired_info.tsv, {sample}_circular_info.tsv) into larger TSVs so we can delete entire folders for some steps.

LeeBergstrand commented 4 months ago

@jmtsuji What were you thinking of for the summary file? Like a TSV file? I have used https://multiqc.info for some summaries, but given the number of QC steps, it might get confusing.

jmtsuji commented 4 months ago

@LeeBergstrand I think my opinion has changed a little bit since I first posted this issue. Personally, I'd be happy with just collecting basic stats for the reads after each QC step -- e.g., determining the total number of reads remaining and the average read length. These basic stats could then be compiled into a TSV file at the end of the QC module (e.g., one TSV file for short reads and one TSV file for long reads). I think BBMap, and probably several other basic read manipulation tools, can collect such basic read stats. What do you think about this approach?

I'd be open to running MultiQC, too. It might simplify things to just run MultiQC on the final reads at the end of the QC module. On the other hand, I wonder if MultiQC is needed, given that we will already be running several other read QC tools beforehand that should clean up the reads. Do you think the TSV file(s) described above would be sufficient for most use cases (and users who want to know stats more could run MultiQC themselves)? Or better to include MultiQC? Let me know your thoughts -- thanks.

jmtsuji commented 4 months ago

P.S. Related to #157

LeeBergstrand commented 4 months ago

@LeeBergstrand I think my opinion has changed a little bit since I first posted this issue. Personally, I'd be happy with just collecting basic stats for the reads after each QC step -- e.g., determining the total number of reads remaining and the average read length. These basic stats could then be compiled into a TSV file at the end of the QC module (e.g., one TSV file for short reads and one TSV file for long reads). I think BBMap, and probably several other basic read manipulation tools, can collect such basic read stats. What do you think about this approach?

I'd be open to running MultiQC, too. It might simplify things to just run MultiQC on the final reads at the end of the QC module. On the other hand, I wonder if MultiQC is needed, given that we will already be running several other read QC tools beforehand that should clean up the reads. Do you think the TSV file(s) described above would be sufficient for most use cases (and users who want to know stats more could run MultiQC themselves)? Or better to include MultiQC? Let me know your thoughts -- thanks.

MultiQC is aggregator. It takes several FASTQC outputs and combines their plots. So, if one FASTQC report HTML file (I think FASTQC makes both visual charts and TSV data) has a line chart showing how the quality of the reads in a single FASTQ file decreases over read length, then a MULTIQC report would plot several lines, one for each FASTQC result used as input.

Workflow:

FASTQ1 --> FASTQC1 ->\
                       --> MULTIQC1 (Summary Report)
FASTQ2 --> FASTQC2 ->/ 

I mostly use MULTIQC to aggregate QC results across samples. For example, in my 16S pipeline, I run FASTQC on multiple sample FASTQ files before and after trimming and then run MULTIQC on these results to make two multi-sample summary reports for before and after QC.

@jmtsuji Questions:

  1. Would aggregating charts for each stage of the QC and putting them all on the same line plots make any sense to you?
  2. Would aggregating charts across samples make any sense to you?

You can also use MULTIQC to aggregate QUAST reports from several assemblies.

LeeBergstrand commented 4 months ago

I think my opinion has changed a little bit since I first posted this issue. Personally, I'd be happy with just collecting basic stats for the reads after each QC step -- e.g., determining the total number of reads remaining and the average read length. These basic stats could then be compiled into a TSV file at the end of the QC module (e.g., one TSV file for short reads and one TSV file for long reads). I think BBMap, and probably several other basic read manipulation tools, can collect such basic read stats. What do you think about this approach?

@jmtsuji I'll look into this. What info do we need from FASTQC? or do you want to skip the whole FASTQC idea altogether? It provides much more data than what you describe in the TSV above.

Can you run through this workflow below and tell me what stats you like and dislike? Should we run this workflow before and after QC is complete and aggregate across samples? What are your thoughts?

FASTQ1 --> FASTQC1 ->\
                       --> MULTIQC1 (Summary Report)
FASTQ2 --> FASTQC2 ->/ 
LeeBergstrand commented 4 months ago

@jmtsuji Some of my initial code for this is in Draft Pull Request: https://github.com/rotary-genomics/rotary/pull/161

jmtsuji commented 4 months ago

MultiQC is aggregator. It takes several FASTQC outputs and combines their plots. So, if one FASTQC report HTML file (I think FASTQC makes both visual charts and TSV data) has a line chart showing how the quality of the reads in a single FASTQ file decreases over read length, then a MULTIQC report would plot several lines, one for each FASTQC result used as input.

@LeeBergstrand Ah, neat! I hadn't realized how generalizable use of MultiQC was (just took a look at the website). This tool looks handy as a general-purpose aggregator. Thanks for the clarification.

jmtsuji commented 4 months ago

Will take a look closer at this (e.g., FastQC outputs) and get back to you soon. Thanks for the draft PR!

jmtsuji commented 4 months ago

@LeeBergstrand Finally had the chance to take a look at this. I ran the code in #161 on a test sample and then ran MultiQC on the FastQC outputs. I quite like the HTML summary report from MultiQC! Seems quite informative and helpful. MultiQC also outputs TSV summaries of general read stats, which is great.

The only thing I noticed that seemed to be missing was that the average quality score of the reads was not reported in the TSV summary, although it was reported in the HTML file. I'd be willing to live with this for now.

I think it would be reasonable to include all intermediate QC steps in the MultiQC summary, not just the raw and final reads. I tried making a MultiQC summary using all of the intermediate files in #161, and it still seemed quite readable. We might need to put some kind of number prefix in front of the QC steps so they are sorted in a logical order in the MultiQC plot. Also, we would need to make a separate report for short and long reads, because these often didn't summarize well in a single plot.

Overall, I'd be OK to move forward with making MultiQC summary plots based on FastQC outputs. @LeeBergstrand are you also in favour? I'll leave more specific code comments in #161 . Thanks again for getting started with QC reporting.

LeeBergstrand commented 4 months ago

@LeeBergstrand Finally had the chance to take a look at this. I ran the code in #161 on a test sample and then ran MultiQC on the FastQC outputs. I quite like the HTML summary report from MultiQC! Seems quite informative and helpful. MultiQC also outputs TSV summaries of general read stats, which is great.

The only thing I noticed that seemed to be missing was that the average quality score of the reads was not reported in the TSV summary, although it was reported in the HTML file. I'd be willing to live with this for now.

I think it would be reasonable to include all intermediate QC steps in the MultiQC summary, not just the raw and final reads. I tried making a MultiQC summary using all of the intermediate files in #161, and it still seemed quite readable. We might need to put some kind of number prefix in front of the QC steps so they are sorted in a logical order in the MultiQC plot. Also, we would need to make a separate report for short and long reads, because these often didn't summarize well in a single plot.

Overall, I'd be OK to move forward with making MultiQC summary plots based on FastQC outputs. @LeeBergstrand are you also in favour? I'll leave more specific code comments in #161 . Thanks again for getting started with QC reporting.

@jmtsuji Sounds good I will pursue this.

jmtsuji commented 4 months ago

@LeeBergstrand Great!