nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
141 stars 8 forks source link

What positions are included in bedMethyl files? Should there be only C in a 5mC_5hmC model bedMethyl file? #238

Closed Adinivich closed 3 months ago

Adinivich commented 3 months ago

Hello all,

We performed Nanopore sequencing on a non-model species of green algae to see the relationship between methylation status and local mutation rate.

Basecalling was done with Dorado v0.7.2 and we used the following models for base modification: dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v1 dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mC_5hmC@v1 dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v1

I transformed the output BAM files to bedMethyl files with the default parameters (I keep the models separate): modkit pileup $file $path/${filename%.sorted.bam}.pileup.bed --log-filepath pileup.log

This gave pileup files that seem reasonable at first glance, but there is something that surprises me.

I expected to have in the BED file only positions that correspond to the base whos modification I'm looking for - for example, only positions that are a C for the 4mC_5mC model, and only positions that are an A for the 6mA model, etc. However, when I look at my list of mutations, I find many mutations G > ... in the bedMethyl file from the 5mCG_5hmCG model, and even some T > ... and A > ... mutations. I expected this file to contain only positions that are a C in the genome, not G, T or A. I have the same problem for my other bedMethyl files (models 6mA, 4mC_5mC and 5mC_5hmC).

Can somebody tell me what is happening?

Thank you!

Troubleshooting so far:

ArtRand commented 3 months ago

Hello @Adinivich,

For the 5mCG_5hmCG model you can pass the --cpg flag along with a reference using the --ref option. This will narrow the resulting table to only reference CpG sites. For 6mA, 5mC_5hmC, and 4mC_5mC, you can similarly use --motif C 0 (or --motif A 0), narrow the records to reference cytosines or adenines, respectively.

Adinivich commented 3 months ago

Hello ArtRand,

Awesome, thank you! I will try that :)

Just out of curiosity, how does modkit pileup choose the sites it includes? There is a very large difference in size between bedMethyl files from the same sample but using different models

Adinivich commented 3 months ago

Hello @ArtRand,

I tried a pileup with the --cpg flag and the --motif flag but the results are not much better: I still have positions that are an G and A in the ref in my 5mCG_5hmCG file, and all 4 nucleotides in my 6mA file.

Would you mind taking a look at my commands?

modkit pileup $path_to_bam/Pc-5mCG_5hmCG.merged.sorted.bam $path_to_out/Pc-5mCG_5hmCG.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --cpg --log-filepath pileupCG_merged.log
modkit pileup $path_to_bam/Pc-6mA.merged.sorted.bam $path_to_out/Pc-6mA.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --motif A 0 --log-filepath pileup6mA_merged.log
modkit pileup $path_to_bam/Pc-5mC_5hmC.merged.sorted.bam $path_to_out/Pc-5mC_5hmC.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --motif C 0 --log-filepath pileup5hmC_merged.log
modkit pileup $path_to_bam/Pc-4mC_5mC.merged.sorted.bam $path_to_out/Pc-4mC_5mC.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --motif C 0 --log-filepath pileup4mC_merged.log

Otherwise, do you think the alignment could be wrong?

Thanks in advance!

ArtRand commented 3 months ago

Hello @Adinivich,

That's strange. Could you send me a sample of the CG data (Pc-5mCG_5hmCG.merged.sorted.bam) and the reference (prasinodermaref.fasta) ? Ideally just a sample of reads from a region of the genome where you're observing the error. You can attach it here or email them to me art.rand[at]nanoporetech.com

Adinivich commented 3 months ago

Hello Arthur,

Thanks for the quick answer!

I'm sorry, the BAM file extract is rather large so I will send it to you by WeTransfer in a separate mail (I imagine there are better options but I don't know them).

I extracted a region from Chr3 from Pc-5mCG_5hmCG.merged.sorted.bam. samtools view -b Pc-5mCG_5hmCG.merged.sorted.bam "Chr3:1000000-2000000" > Prasi_Chr3.bam

In this region, I found 4 point mutations, 2 of which are on a C in the ref, 1 on a T and 1 on a G:

Chrom Position Coverage Meth% Ref Chr3 1054810 52 0.00 G Chr3 1494473 43 4.65 T Chr3 1660373 64 0.00 C Chr3 1671377 51 0.00 C

You can see the T is even found to be methylated a bit.

I also extracted all positions on Chr3 in Pc-5mCG_5hmCG.merged.pileup.bed:

grep -w 'Chr3' Pc-5mCG_5hmCG.merged.pileup.bed > Prasi_Chr3.bed

I do not have a huge list of mutations, only 43, but very roughly about 1/3 of mutations is not found in the files I would expect them to be, so I worry that the C model files contain a lot of positions that are not C and that will confuse my data.

My goals with the methylation data are:

So I need to be sure that my files only contain methylation status of a given context, otherwise my average methylation values etc will be off.

Thanks a lot for your help!!!

[Uploading Prasi_Chr3.bed.gz…]() [Uploading prasinodermaref.fasta.txt…]()

Adinivich commented 3 months ago

Hi A,

I looked in detail at the positions where I found single nucleotide mutations in my experiment. Of course it is not only these positions that matter but they represent a list of well spread out positions over the genome for which I already have information on the reference genome etc. I will share my observations with you, maybe you see something that I missed.

I located my mutation positions in the BED files that were created using the --motif or --cpg flag:

modkit pileup $path_to_bam/Pc-5mCG_5hmCG.merged.sorted.bam $path_to_out/Pc-5mCG_5hmCG.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --cpg --log-filepath pileupCG_merged.log
modkit pileup $path_to_bam/Pc-6mA.merged.sorted.bam $path_to_out/Pc-6mA.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --motif A 0 --log-filepath pileup6mA_merged.log
modkit pileup $path_to_bam/Pc-5mC_5hmC.merged.sorted.bam $path_to_out/Pc-5mC_5hmC.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --motif C 0 --log-filepath pileup5hmC_merged.log
modkit pileup $path_to_bam/Pc-4mC_5mC.merged.sorted.bam $path_to_out/Pc-4mC_5mC.merged.pileup.bed --ref $path_to_ref/prasinodermaref.fasta --motif C 0 --log-filepath pileup4mC_merged.log

For each position, I noted among other things the reference base, the models where the position is found, the intervals -1 to +1 and -3 to +3, and the closest position that would be eligible for the model in which the position is found in my eyes. I sent the excel file with the information by mail.

These are my observations for the moment for my list of 43 positions:

NB I did check some of the positions in IGV with the BAM files just to be sure, the aligned reads support the ref base.

Thanks again!

Adinivich commented 3 months ago

Thank you Art for helping me make sense of my results :)

For those who have the same questions as me, my confusion came from the fact that Modkit is 0-based and the other programs I use (like IGV) are 1-based. I looked at the position in column 2 while I should have taken column 3 to locate positions of interest in my bedMethyl file.

Also, the --cpg and --motif flags really helped to get clean results (only A and T in the 5mA model, only C and G in the 5mC models etc.)

Thanks again!