Use MultiQC reports to generate tables in automated fashion

hputnam commented 4 years ago

Trimmed

Bismark

shellywanamaker commented 4 years ago

trimmed multiqc report files are here:

After MD5 file hashes are checked, Multiqc needs to be run on:

Mcap RRBS alignments: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/031520-TG-bs/Mcap_tg/dedup/
Mcap WGBS_MBD alignments: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/031520-TG-bs/Mcap_tg/nodedup/
Pact RRBS alignments: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/031520-TG-bs/Pact_tg/dedup/
Pact WGBS_MBD alignments: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/031520-TG-bs/Pact_tg/dedup/
Pact-C1 RRBS alignments: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/032320-Pact-C1/Pact_C1/nodedup/
Pact-C1 WGBS_MBD alignments: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/032320-Pact-C1/Pact_C1/dedup/

shellywanamaker commented 4 years ago

multiqc :

script here: https://github.com/hputnam/Meth_Compare/blob/master/scripts/MethCompare_MultiQC.ipynb
output directory: https://gannet.fish.washington.edu/metacarcinus//FROGER_meth_compare/20200413
html report: https://gannet.fish.washington.edu/metacarcinus//FROGER_meth_compare/20200413/multiqc_report.html

working on generating tables in R

shellywanamaker commented 4 years ago

created the following draft tables using this R markdown: https://github.com/hputnam/Meth_Compare/blob/master/analyses/FormatMultiQC/FormatMultiQC.Rmd

So far I've just merged tables from the MultiQC output so there's a lot of info (e.g. % CpG methylation before/after methylation extractor).

@hputnam @mgavery @yaaminiv @sr320 give the tables a look and we can decide what columns to keep or remove.

yaaminiv commented 4 years ago

The trimming information is supplemental right? I don't see any reason to not include any of the columns in that file.

@shellytrigg For the Pact/Mcap alignments, do the duplicate columns (ex. aligned_reads.x and aligned_reads.y) map to the paired files? I think we collapsed the paired files in the big mapping stats table which I think makes it easier to interpret if we're including the mapping information as an in-text table.

Could you include a "no genomic sequence column"? And does "discarded reads" mean duplicated reads?

I know the strand alignment information is in the MultiQC report, but I don't know enough about sequencing methods to know whether or not that would be an important thing to report. I don't think we need the unmethylated CpG/CHG/CHH information.

hputnam commented 4 years ago

We need unmethylated CpG/CHG/CHH information to calculate lambda conversion efficiency. It would be great if we could do this in the same script and table. For now, I had done it in a stand alone excel file https://github.com/hputnam/Meth_Compare/blob/master/metadata/lambda_conversion.xlsx

hputnam commented 4 years ago

It would also be interesting at some point to calculate this conversion between the lambda and just using the non-CpG sensu Liew et al 2018 to think about the rationale for including lambda spikes. "Lambda DNA, which can be spiked-in to estimate the combined nonconversion and mis-sequencing rate during the bisulfite treatments, was not used in our sequencing runs. However, as we observed that the rates of non-CpG methylation (CHG and CHH, where H = non-G base) were at 0.1% in all samples (data S1), the combination of noncon- version and mis-sequencing would be—at worst—0.1%, if we assumed that CHG and CHH methylation does not occur in S. pistillata."

shellywanamaker commented 4 years ago

@yaaminiv :

1) the duplicate column names (ex. aligned_reads.x and aligned_reads.y) come from the bismark alignment report and from the reports after deduplication and methylation extraction. Since the numbers aren't always identical because of reads removed from deduplication, etc. I could make the column names more specific.

2) The "no genomic sequence" column appears to be named "discarded reads" in the bismark alignment report. Take a look at the bismark section of the html multiqc report which shows "no genomic sequence" in the alignment scores plot. The numbers match the discarded reads numbers in the bismark alignment report.

@hputnam I can look into calculating lambda conversion in R based off your xlsx. For your second point on comparing lambda conversion vs. using non-CpG methylation rates, is there a calculation that was done? Or are they just using the max % methylation of CHG or CHH in any sample?

hputnam commented 4 years ago

Here is what they supplied for their conversion calculation https://advances.sciencemag.org/highwire/filestream/204665/field_highwire_adjunct_files/0/aar8028_DataS1_to_S9.xlsx in Data S1 tab

yaaminiv commented 4 years ago

@shellytrigg Thanks for clarifying! I think making the column names more specific would be helpful.

shellywanamaker commented 4 years ago

@hputnam here is the lambda stats table generated by the same R markdown (https://github.com/hputnam/Meth_Compare/blob/master/analyses/FormatMultiQC/FormatMultiQC.Rmd) using the multiqc output from the lambda alignments.

The stats table includes conversion efficiency calculated based off your excel file (https://github.com/hputnam/Meth_Compare/blob/master/metadata/lambda_conversion.xlsx). These calculations only include CHH and CHG data from lambda alignments (see lines 82-98 of R markdown file).

I'm going to open a new issue for comparing conversion efficiency with lambda vs estimating conversion efficiency with %CHH and CHG methylation .

shellywanamaker commented 4 years ago

lastly, here are the tables with cleaned up columns:

hputnam / Meth_Compare

Use MultiQC reports to generate tables in automated fashion #37