Support aggregate reporting for demultiplexed FASTQ files

s-andrews / FastQC

A quality control analysis tool for high throughput sequencing data

GNU General Public License v3.0

444 stars 86 forks source link

Support aggregate reporting for demultiplexed FASTQ files #124

Open mtomko opened 1 year ago

mtomko commented 1 year ago

Our group has long generated FastQC reports for a single lane of sequencing at a time. Our sequencing provider is now only providing demultiplexed FASTQs, which means that we need to look at hundreds of FastQC reports instead of just 2. We would be interested in an option to FastQC that generated one aggregate report for all of the demultiplexed FASTQ files, summarizing the overall quality of all of them. This would be akin to the report generated by simply concatenating all the FASTQ files and running FastQC on that.

I would consider implementing this myself if it would be welcome.

mtomko commented 1 year ago

Ah, my coworker has pointed out that it's possible to do this by reading from standard in:

If you want to run fastqc on a stream of data to be read from standard input then you can do this by specifing 'stdin' as the name of the file to be processed and then streaming uncompressed fastq format data to the program. For example:
zcat *fastq.gz | fastqc stdin
If you want the results from a streamed analysis sent to a file with a name other than stdin then you can add a colon and put the file name you want, for example:
zcat *fastq.gz | fastqc stdin:my_results
..would write results to my_result.html and my_results.zip.

s-andrews commented 1 year ago

You've found one option for this which will combine the full set of results. To be honest, if you're just looking at data quality then it's pretty unlikely that you'll see a difference in quality between the different split subsets of reads so any of the reports is likely to be representative.

The other option to consider is MultiQC (https://multiqc.info/) which you can run in a directory where you have multiple FastQC (and other programs) reports and it will aggregate them into a single combined report. We use this on the end of our sequencing pipelines and it works great for this purpose.