maxplanck-ie / snakepipes

Customizable workflows based on snakemake and python for the analysis of NGS data
http://snakepipes.readthedocs.io
MIT License
374 stars 85 forks source link

WGBS QC Report: wrong mapping rate and PCR duplication rate #941

Closed hanrong498 closed 5 months ago

hanrong498 commented 9 months ago

Hi!

I think I might have found a small bug in the WGBS QC Report.

The Mapping rate session reports a mapping rate different to the General Statistics from MultiQC.

WGBS QC report: Screenshot 2023-10-20 at 16 54 07

MultiQC: Screenshot 2023-10-20 at 17 08 49

=========================================

I took a look at the .flagstat file:

[hhu@jed all_processed]$ cat QC_metrics/211029_1_FZ_WGBS_HOIK_p35_rep1_S1_L001.flagstat 
391689933 + 164983226 in total (QC-passed reads + QC-failed reads)
391638322 + 164983202 primary
51611 + 24 secondary
0 + 0 supplementary
68380322 + 21154712 duplicates
68380322 + 21154712 primary duplicates
381775564 + 164980582 mapped (97.47% : 100.00%)
381723953 + 164980558 primary mapped (97.47% : 100.00%)
391638322 + 164983202 paired in sequencing
195819161 + 82491601 read1
195819161 + 82491601 read2
363065656 + 0 properly paired (92.70% : 0.00%)
375781506 + 164977914 with itself and mate mapped
5942447 + 2644 singletons (1.52% : 0.00%)
9843812 + 11690 with mate mapped to a different chr
7785225 + 4909 with mate mapped to a different chr (mapQ>=5)

From the code provided to calculate the mapping rate in WGBS pipeline, it seems that it actually extract the information on the 5th line, which corresponds to 68380322 + 21154712 duplicates rather than the mapped reads.

if (length(snakemake@input[["fstat"]] > 0)) {
    fl = lapply(snakemake@input[["fstat"]], function(x) read.table(x, header=FALSE, sep="\t", quote="", as.is=TRUE))
    totals = sapply(fl, function(x) {
        v = strsplit(x$V1[1], " ")[[1]]
        as.numeric(v[1]) + as.numeric(v[3])
    })
    mapped = sapply(fl, function(x) {
        v = strsplit(x$V1[5], " ")[[1]]
        as.numeric(v[1]) + as.numeric(v[3])
    })
    filtered = totals - sapply(fl, function(x) {
        v = strsplit(x$V1[5], " ")[[1]]
        v2 = strsplit(x$V1[5], " ")[[1]]
        as.numeric(v[1])
    })
    d = data.frame(Total=totals,
                   Mapped=mapped,
                   MappedPercent=100*mapped/totals,
                   Filtered=filtered,
                   FilteredPercent=100*filtered/totals)
    rownames(d) = gsub(".flagstat", "", sapply(snakemake@input[["fstat"]], basename))
    pander(d, style='simple', split.table=Inf)
} else {
    message('No flagstat files found!')
}

Similarly, the PCR duplication rate is extracting the information in the 4th line, which corresponds to 0 + 0 supplementary

if (length(snakemake@input[["fstat"]] > 0)) {
    fl = lapply(snakemake@input[["fstat"]], function(x) read.table(x, header=FALSE, sep="\t", quote="", as.is=TRUE))
    totals = sapply(fl, function(x) {
        v = strsplit(x$V1[1], " ")[[1]]
        as.numeric(v[1]) + as.numeric(v[3])
    })
    dupes = sapply(fl, function(x) {
        v = strsplit(x$V1[4], " ")[[1]]
        as.numeric(v[1]) + as.numeric(v[3])
    })
    d = data.frame(Duplicates=dupes,
                   DuplicationRate=100 * dupes/totals)
    rownames(d) = gsub(".flagstat", "", sapply(snakemake@input[["fstat"]], basename))
    pander(d, style='simple')
} else {
    message('No information on PCR duplicate removal found. If you started the pipeline from bam files, this is expected outcome. Otherwise, an error might have occurred.')
}

This is potentially the cause of the issue #898, where the mapping rate is reported to be very low.

Best, Hanrong

katsikora commented 9 months ago

Dear Hanrong,

thank you very much for sharing your insight. You're points look valid to me. Although I wasn't able to reproduce #898 concerning the low mapping rates, neither on our nor on the user's test data, I can take another look, considering your suggestion.

Best wishes,

Katarzyna

katsikora commented 5 months ago

The fix is now available in snakePipes 2.8.0, 2.8.1.