laurahspencer / DuMOAR

0 stars 0 forks source link

MBD-BS: align and call meth status using Bismark #15

Closed laurahspencer closed 10 months ago

laurahspencer commented 1 year ago

Aligned trimmed MBD-BS data to Dungeness crab genome using Bismark, called meth status, generated reports. Used the slurm script bismark.sh.

Very weird results - average of ~6% alignment rate! Something went wrong, must investigate.

sr320 commented 1 year ago

Kind of looks like you are aligning to cod genome?

laurahspencer commented 1 year ago

Haha- you are right. Wow how did I do that. Thanks for the catch!

kristamnichols commented 1 year ago

That would explain it!

laurahspencer commented 1 year ago

Bismark is still running (with the right genome), but here's a sneak peak at the results:

Alignment rate:

CH01_06_S1_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       50.3%
CH01_14_S2_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       46.6%
CH01_22_S3_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       50.9%
CH01_38_S4_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       46.7%
CH03_04_S5_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       31.3%
CH03_15_S6_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       30.7%
CH03_33_S7_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       50.5%
CH05_01_S8_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       40.3%
CH05_06_S9_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:       29.6%
CH05_21_S10_R1_val_1_bismark_bt2_PE_report.txt:Mapping efficiency:      50.6%

CpG methylation rate - weird that it's all over the board:

CH01_06_S1_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      75.2%
CH01_14_S2_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      69.7%
CH01_22_S3_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      68.1%
CH01_38_S4_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      69.6%
CH03_04_S5_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      11.9%
CH03_15_S6_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      21.1%
CH03_33_S7_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      74.4%
CH05_01_S8_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      53.9%
CH05_06_S9_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:      7.3%
CH05_21_S10_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context:     65.8%

CHG methylation rate is consistent

CH01_06_S1_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.1%
CH01_14_S2_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.4%
CH01_22_S3_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.0%
CH01_38_S4_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.4%
CH03_04_S5_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.9%
CH03_15_S6_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.2%
CH03_33_S7_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.0%
CH05_01_S8_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.6%
CH05_06_S9_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:      1.1%
CH05_21_S10_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHG context:     1.1%

But CHH methylation varies a ton

CH01_06_S1_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      5.0%
CH01_14_S2_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      8.5%
CH01_22_S3_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      5.8%
CH01_38_S4_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      8.8%
CH03_04_S5_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      29.5%
CH03_15_S6_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      14.1%
CH03_33_S7_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      4.2%
CH05_01_S8_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      20.3%
CH05_06_S9_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:      11.8%
CH05_21_S10_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CHH context:     8.9%
kubu4 commented 1 year ago

Not sure how to explain variation, but looking at FastQC (MultiQC) reports, I noticed this:

Screencap of FastQC Per Sequence Base GC Content of crab MBD-seq with blue arrow pointing at unexpected, secondary peak


In other MBD-seq data we've run (e.g. Olympia oysters), we don't see that extra peak (red lines are MBD-seq; grey lines are whole-genome BSseq):

Screencap of FastQC Per Sequence Base GC Content of Olympia oyster MDB-seq data

mgavery commented 1 year ago

I also remember seeing this - maybe not this extreme - in the MiSeq data we were using for QC. There was definitely a correlation between mapping rate and % methylation.

sr320 commented 1 year ago

To confirm these are all MBD and not WGBS? If all MBD we should probably remove some from downstream analysis..

kristamnichols commented 1 year ago

I assume these were only the 20 animals that we did MBD-seq on, with the libraries prepared by Sam, and did not include the single animal that we did WGBS? We should link out here to the preliminary MBD-seq analysis from the MiSeq data to have those records here?

On Wed, Nov 2, 2022 at 5:17 PM Steven Roberts @.***> wrote:

To confirm these are all MBD and not WGBS? If all MBD we should probably remove some from downstream analysis..

— Reply to this email directly, view it on GitHub https://github.com/laurahspencer/DuMOAR/issues/15#issuecomment-1301514328, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3RZJOHY4RMFL5RMQP3OIULWGMABPANCNFSM6AAAAAARTQF5EI . You are receiving this because you commented.Message ID: @.***>

-- ................................................................... Krista M. Nichols, PhD Program Manager, Genetics and Evolution Conservation Biology Division Northwest Fisheries Science Center NOAA, National Marine Fisheries Service 2725 Montlake Blvd E Seattle, WA 98112 206.302.2470 (Google Voice)

kristamnichols commented 1 year ago

What is the extra peak?

kubu4 commented 1 year ago

For reference, MiSeq data looked like this (red are the 13 samples with warnings from FastQC; grey are samples which were considered "passing" by FastQC). The peaks appear to be less prominent and in fewer samples...

Screencap of FastQC Per Sequence GC Content of MiSeq run of M.magister libraries. Red lines represent samples which generated FastQC warnings; grey lines represent those samples which "passed" FastQC

Here's link to MiSeq MultiQC report:

https://gannet.fish.washington.edu/Atumefaciens/20201211_mmag_fastqc_multiqc_mbd-bsseq_miseq/multiqc_report.html

kubu4 commented 1 year ago

What is the extra peak?

No idea.

kubu4 commented 1 year ago

Also, here's another weird thing I noticed. It's minor (I'm assuming), but haven't seen this in samples before. There's a "blip" in all of the samples at the same location in each read. Due to the shorter read length of the MiSeq data, the MiSeq run doesn't show this anomaly.

Screencap of FastQC Per Base Sequence Content. Blue arrow points to strange anomaly occurring in each sample, in the same location.

kubu4 commented 1 year ago

What is the extra peak?

Looking into this a bit more, it seems like it could be satellite DNA.

kubu4 commented 1 year ago

Another thing I noticed (when looking back at the post-trimming FastQC results), is that it looks like the first 10bp didn't end up getting trimmed. Odd.

laurahspencer commented 1 year ago

I assume these were only the 20 animals that we did MBD-seq on, with the libraries prepared by Sam, and did not include the single animal that we did WGBS?

Yes, I believe the data I'm working with is only the 20 MBD-seq libraries, which I grabbed from the Sedna here: /share/nwfsc/knichols/Dungeness/MBDseq_UOdata. That directory contains

I wrote up the source of the data I'm working with here: data/raw-rnaseq

We should link out here to the preliminary MBD-seq analysis from the MiSeq data to have those records here?

Here is a notebook summarizing the MiSeq analysis - MBDBS MiSeq Analysis. Here is the MultiQC report on the trimmed MiSeq data. The Bismark Summary Report from MiSeq aligned to the new Dungeness crab genome shows how some samples had lower coverage and lower CpG methylation - which I look into further in my RMarkdown notebook - check out the correlation plots (specifically the one where I plot "meth ~ total reads that are unique".

laurahspencer commented 1 year ago

Regarding the variable % CpG methylation and alignment rates:

I've quickly compared the samples with very low CpG meth/coverage in the MiSeq analysis to those in this new MBD-BS-Seq run, and they are the same:

% CpG methylated, MiSeq run:

$ grep "C methylated in CpG context:" *PE_report.txt CH01-06_S1_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 76.2% CH01-14_S2_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 70.9% CH01-22_S3_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 69.1% CH01-38_S4_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 71.3% CH03-04_S5_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 12.6% CH03-15_S6_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 21.8% CH03-33_S7_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 75.5% CH05-01_S8_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 56.7% CH05-06_S9_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 7.4% CH05-21_S10_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 67.5% CH05-24_S11_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 75.0% CH07-06_S12_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 67.5% CH07-11_S13_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 8.4% CH07-24_S14_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 7.4% CH09-02_S15_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 71.9% CH09-13_S16_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 53.0% CH09-28_S17_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 70.3% CH10-01_S18_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 73.3% CH10-08_S19_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 36.0% CH10-11_S20_L001_R1_001_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 71.0%

% CpG methylated, MBD-BS-Seq run

(base) [lspencer@sedna aligned]$ grep "C methylated in CpG context:" *.txt CH01_06_S1_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 75.2% CH01_14_S2_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 69.7% CH01_22_S3_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 68.1% CH01_38_S4_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 69.6% CH03_04_S5_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 11.9% CH03_15_S6_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 21.1% CH03_33_S7_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 74.4% CH05_01_S8_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 53.9% CH05_06_S9_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 7.3% CH05_21_S10_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 65.8% CH05_24_S11_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 73.4% CH07_06_S12_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 66.3% CH07_11_S13_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 8.1% CH07_24_S14_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 7.2% CH09_02_S15_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 70.0% CH09_13_S16_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 50.9% CH09_28_S17_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 69.4% CH10_01_S18_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 72.1% CH10_08_S19_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 33.3% CH10_11_S20_R1_val_1_bismark_bt2_PE_report.txt:C methylated in CpG context: 69.1%

laurahspencer commented 1 year ago

Another thing I noticed (when looking back at the https://github.com/laurahspencer/DuMOAR/issues/15#issuecomment-1301533710), is that it looks like the first 10bp didn't end up getting trimmed. Odd.

The length distributions peak at 80bp and 130bp, which suggests that we did in fact trim 10bp from each end of the 100bp and 150bp reads.

image

laurahspencer commented 1 year ago

Actually- they peak at 82bp and 131bp!

kubu4 commented 1 year ago

The length distributions peak at 80bp and 130bp, which suggests that we did in fact trim 10bp from each end of the 100bp and 150bp reads.

It's definitely clear that 20bp was trimmed from each read, however, the first 10bp have the same appearance as in the raw data when looking at the Per Base Sequence Content. That's the part I found to be odd - trimming clearly occurred, but doesn't seem like trimming happened at the 5' ends of reads.

laurahspencer commented 1 year ago

MultiQC report summarizing Bismark report with data aligned to Dungeness crab genome (NOTE: probably will re-run this after trimming raw data prior to concatenating).

laurahspencer commented 1 year ago

UPDATE:

I retrimmed the data before concatenating. Check out the MultiQC reports of raw data - 4416/ has 100bp reads, 4417/ has 150bp reads - and of trimmed data 4416/, 4417/.

As Sam pointed out, the first 10bp still look weird after trimming (see screen shots of one file before & after trimming below). I moved forward anyway, but can certainly go back and hard-trim another 10bp from the 5' ends.

Before trimming:

image

After trimming:

image

I aligned & called methylation status using Bismark on the new set of trimmed data. See the Bismark summary report & MultiQC report. Overall the mapping efficiency is not great - averages 44%. Mapping efficiency and %CpG meth. still are odd for the same samples. Check out my R notebook where I look at coverage, methylation rates, and correlations between %meth & other alignment + library prep stats. Let me know if anything jumps out to you. For me, it's that the # uniquely aligned reads correlates with %CpG meth:

image

And it's very weird that the # of unaligned reads due to no genomic sequence correlates really well with % CpG methylation (despite there only being a couple hundred of those sequences).

image

If we filter for 15x coverage and plot the %meth ~ # aligned sequences, the samples segregate well into those with high, medium, and low methylation rates:

image

Is it possible that we had bisulfite conversion issues? The % CHG meth. is consistently low, but the % CHH is variable and very high in some samples. Can someone confirm that the MBD-BS libraries were not lambda-spiked? Am I correct in presuming that we can't use the mitochondrial genome to estimate conversion efficiency, since we used MBD and thus enriched for methylated regions?

I could do that additional trimming to see if it improves alignment rates and coverage of some samples, any other suggestions?

mgavery commented 1 year ago

The libraries were not lambda spiked.

If you want to evaluate bisulfite conversion efficiency, then I think you have to take this data to next step and look at methylation per locus (using methylKit or something similar). Here is what I would be interested in seeing:

  1. Is there an issue with bisulfite conversion? If a given CG is always reporting higher methylation (less conversion) or lower methylation (maybe overconversion?) then you will see that on a per locus basis. If the outlier samples are outliers here, then it might mean bisulfite conversion is an issue.

alternatively, I wonder if..

  1. Is there an issue with MBD enrichment? If there was an issue with MBD enrichment then more of your reads would have unmethylated CGs (overall methylation would be lower), and maybe these unmethylated reads wouldn't map as well because a higher proportion would be intergenic (based on methylation being mostly limited to genes in bivalves). If the outlier samples look similar to other samples on a per-locus basis, then it might point to issues with enrichment.

If it turns out to be option 2, then you wouldn't have to remove samples, I imagine you'd just have fewer CG in common across all samples to evaluate

sr320 commented 1 year ago

@laurahspencer my alignment rates are a bit different

[sr320@mox1 031223-sc-17]$ cat *txt | grep "Mapping"
Mapping efficiency: 26.9% 
Mapping efficiency: 26.1% 
Mapping efficiency: 27.7% 
Mapping efficiency: 26.9% 
Mapping efficiency: 20.4% 
Mapping efficiency: 19.5% 
Mapping efficiency: 26.4% 
Mapping efficiency: 24.2% 
Mapping efficiency: 18.6% 
Mapping efficiency: 27.6% 
Mapping efficiency: 29.0% 
Mapping efficiency: 26.5% 
Mapping efficiency: 18.6% 
Mapping efficiency: 21.3% 
Mapping efficiency: 27.9% 
Mapping efficiency: 29.2% 
Mapping efficiency: 27.1% 
Mapping efficiency: 28.0% 
Mapping efficiency: 23.2% 
Mapping efficiency: 28.3% 

[sr320@mox1 031223-scont-17]$ cat *txt | grep "Mapping"
Mapping efficiency: 28.4% 
Mapping efficiency: 27.4% 
Mapping efficiency: 29.3% 
Mapping efficiency: 27.9% 
Mapping efficiency: 20.3% 
Mapping efficiency: 19.5% 
Mapping efficiency: 27.8% 
Mapping efficiency: 25.2% 
Mapping efficiency: 18.5% 
Mapping efficiency: 28.7% 
Mapping efficiency: 30.4% 
Mapping efficiency: 27.4% 
Mapping efficiency: 18.5% 
Mapping efficiency: 20.9% 
Mapping efficiency: 29.1% 
Mapping efficiency: 29.9% 
Mapping efficiency: 28.5% 
Mapping efficiency: 29.2% 
Mapping efficiency: 24.0% 
Mapping efficiency: 29.2% 

my code

find ${reads_dir}*_R1_001_val_1.fq.gz \
| xargs basename -s _R1_001_val_1.fq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-u 40000 \
-p 40 \
-score_min L,0,-0.6 \
-1 ${reads_dir}{}_R1_001_val_1.fq.gz \
-2 ${reads_dir}{}_R2_001_val_2.fq.gz \

suspect I am doing something wrong but curious what your -score_min

laurahspencer commented 1 year ago

My code, from this script

find ${IN}*_R1.fastq \
| xargs basename -s _R1.fastq | xargs -I{} bismark \
--bowtie2 \
-genome /home/lspencer/references/dungeness/ \
-p 4 \
-score_min L,0,-0.6 \
--non_directional \
-1 ${IN}{}_R1.fastq \
-2 ${IN}{}_R2.fastq \
--output_dir ${OUT}
laurahspencer commented 1 year ago

I aligned to the unfiltered genome version - i.e. not just the 49 large scaffolds, could that have affected things?