Generate improved report QC

saramonzon commented 4 years ago

We create this type of report for researchers, this way they can see at a glance how the experiment worked:

sample	host	Virus sequence	total reads	reads host	% reads host	reads virus	%reads virus	unmapped reads	% unmaped reads	mean DP coverage virus	Coverage > 5x (%)	NumVariantsTrimIVAR	%Nswithoutprimers
201397	human	NC_045512,2	2365486	23052	0,97%	2335037	98,71%	7397	0,3127053	15289,14611	0,985386	5	2.50
201493	human	NC_045512,2	2077038	22299	1,07%	2049058	98,65%	5681	0,2735145	13671,11417	0,997024	6	0.32
201495	human	NC_045512,2	1983106	14431	0,73%	1963048	98,99%	5627	0,28374681	14110,63539	0,997893	7	0.33
201575	human	NC_045512,2	2092372	4073	0,19%	2080486	99,43%	7813	0,37340396	14861,83768	0,998763	7	0.34
201602	human	NC_045512,2	1821320	92766	5,09%	1730140	94,99%	-1586	-0,0870797	11835,88687	0,992308	6	2.27
201607	human	NC_045512,2	2531506	23880	0,94%	2499503	98,74%	8123	0,32087619	18076,30037	0,997458	7	0.49
201617	human	NC_045512,2	2232766	10799	0,48%	2212565	99,10%	9402	0,42109204	16235,09561	0,99786	3	0.32
201706	human	NC_045512,2	2642668	6370	0,24%	2628133	99,45%	8165	0,30896806	18056,17594	0,998763	6	0.24
201709	human	NC_045512,2	1690968	13871	0,82%	1673082	98,94%	4015	0,23743796	11272,06076	0,998595	5	0.64
201738	human	NC_045512,2	2723604	3457	0,13%	2708670	99,45%	11477	0,42139019	18263,53416	0,998763	2	0.32
202050	human	NC_045512,2	1366142	467391	34,21%	898349	65,76%	402	0,02942593	5552,181253	0,95268	5	10.35
202052	human	NC_045512,2	558458	238928	42,78%	167130	29,93%	152400	27,2894291	913,072902	0,946494	4	14.17

Also we have worked in some graphs for the amplicon experiment in order to see how homogeneus the depht of coverage is among amplicons (using bedtools coverage), but a lot of improvement can be done here. And including it as custom content in multiQC is a plus!!

stevekm commented 4 years ago

how do you want to make this report? I have done custom R Markdown based reporting before. Not sure if you want that or if you prefer to build it directly into MultiQC?

ewels commented 4 years ago

Nice! Looks like it should fit MultiQC very well. Let me know if you’d like any help.

drpatelh commented 4 years ago

It would be great if we can include the output from ivar trim, kraken2 and varscan mpileup2cns in the MultiQC report as a plugin/custom content. As requested, I have added some test data here @ewels.

Once we are able to add this all into MultiQC then generating a table with a subset of metrics would be great!

apeltzer commented 4 years ago

PR for iVar to MultiQC: https://github.com/ewels/MultiQC/pull/1159 PR for VarScan2 to MultiQC (supports SNp, INDEL and CNS files): https://github.com/ewels/MultiQC/pull/1160

Feedback greatly appreciated - if any of you wants to have a look 👍

ktrns commented 4 years ago

Hi there, I am going through the current MultiQC and this is what I noticed first:

DONE: It would be nice to know what the file names correspond to in "General Statistics" (Sample, Sample_1, Sample_2, Sample_T1, Sample_T1_1, Sample_T1_2. From @drpatelh: The _T1 suffixes wont be visible in the report anymore because the pipeline will now merge the samples right at the beginning of the pipeline if applicable
DONE: The order: is it possible to better structure the report such as in "Pipeline Summary" of the README.md? Cutadapt is run for the assembly process (point 6 in README.md), yet Bowtie 2 is reported later in MultiQC although it is run for the variant calling process (point 5 in README.md)
NOT POSSIBLE: Nesting? I would think that a clear order and overview would be great, I don't know if nesting is possible with MultiQC. The feedback I got with the viralrecon MultiQC so far is that people got a bit lost.

... more later.

Thank you!!

ktrns commented 4 years ago

Hi @drpatelh, I think the new organisation of your MultiQC report is much nicer. Here a few comments:

Per Sequence GC Content: Normal random library typically have a roughly normal distribution of GC content. -> libraries? (I am not a native English speaker, but seems like a mix of singular and plural)
DONE: VARIANTS: SAMTools (iVar): This section of the report shows SAMTools counts/statistics after primar sequence removal with iVar. -> primer not primar
Mapped reads per contig: Default is on "Normalised counts" - is that a useful plot for this pipeline? We know the number of mapped reads already from previous tables and as discussed yesterday, the pipeline works for one chromosome genomes for now. Seems redundant to me.
Indel Distribution: Should plots like this one be skipped if there are absolutely none? I guess this is for all Bcftools plots, they look empty in your report. Maybe empty plots can be skipped, and a message says that there are no indels/whatsoever.

Thanks a lot for all the work! Further suggestions will follow :-). Best Katrin

drpatelh commented 4 years ago

Based on the current version of the pipeline the files below are collated in the MultiQC work directory. I have listed the files that we could use to create a custom table for the samples as a starting point. We would have to test and check how these are reported for both PE and SE reads. One idea I discussed briefly with @ewels was to use the parsing functionality of MultiQC modules to read the data directly into a custom Python script that we could then use to collate of the data and output as a table.

e.g. https://github.com/ewels/MultiQC/blob/master/multiqc/modules/samtools/flagstat.py

###############################
## PREPROCESSING METRICS
###############################

## TOTAL NUMBER OF INPUT READS
├── fastqc
│   ├── SAMPLE1_PE_1.merged_fastqc.html
│   ├── SAMPLE1_PE_1.merged_fastqc.zip

## NUMBER OF READS LEFT AFTER ADAPTER & QUALITY TRIMMING RAW FASTQ
├── fastp
│   └── log
│       ├── SAMPLE1_PE.fastp.html
│       ├── SAMPLE1_PE.fastp.json
│       ├── SAMPLE1_PE.fastp.log

###############################
## VARIANT CALLING METRICS
###############################

## NUMBER OF READS MAPPED TO VIRAL GENOME
├── bowtie2
│   ├── flagstat
│   │   ├── SAMPLE1_PE.sorted.bam.flagstat
│   └── log
│       ├── SAMPLE1_PE.bowtie2.log

## TOTAL NUMBER OF VARIANTS CALLED
## TOTAL NUMBER OF Ns in consensus
├── varscan2
│   ├── quast
│   │   └── highfreq
│   │       └── quast
│   └── variants
│       ├── highfreq
│       │   ├── SAMPLE1_PE.highfreq.varscan2.log
│       └── lowfreq
│           ├── SAMPLE1_PE.lowfreq.varscan2.log

## INSERT SIZE MEAN AND STD DEV
## COVERAGE METRICS?
## OTHERS?
├── picard
│   ├── SAMPLE1_PE.trim.CollectMultipleMetrics.alignment_summary_metrics
│   ├── SAMPLE1_PE.trim.CollectMultipleMetrics.insert_size_metrics
│   ├── SAMPLE1_PE.trim.CollectWgsMetrics.coverage_metrics

## TOTAL NUMBER OF Ns in consensus
## TOTAL NUMBER OF READS LEFT AFTER PRIMER TRIMMING
## TOTAL NUMBER OF VARIANTS CALLED WITH IVAR
├── ivar
│   ├── consensus
│   │   └── quast
│   │       └── quast
│   ├── trim
│   │   ├── flagstat
│   │   │   ├── SAMPLE1_PE.trim.sorted.bam.flagstat
│   │   └── log
│   │       ├── SAMPLE1_PE.trim.ivar.log
│   └── variants
│       ├── counts
│       │   ├── SAMPLE1_PE.variant.counts_mqc.tsv

###############################
## DE NOVO ASSEMBLY METRICS
###############################

## NUMBER OF READS LEFT AFTER PRIMER TRIMMING RAW FASTQ
├── cutadapt
│   └── log
│       ├── SAMPLE1_PE.cutadapt.log

## NUMBER OF CLASSIFIED/UNCLASSIFED READS
├── kraken2
│   ├── SAMPLE1_PE.kraken2.report.txt

## NUMBER OF Ns in ASSEMBLY 
## OTHER ASSEMBLY METRICS
├── spades
│   ├── quast
│   │   └── quast
├── metaspades
│   ├── quast
│   │   └── quast
├── unicycler
│   ├── quast
│   │   └── quast
├── minia
│   ├── quast
│   │   └── quast

###############################

drpatelh commented 4 years ago

Will be mostly fixed in https://github.com/nf-core/viralrecon/pull/102

nf-core / viralrecon

Generate improved report QC #9