Not all genomic position represented

OceaneMion commented 2 months ago

Hi again, I would like to know, in modkit pileup output file, as well as in bedgraph, not all genomic position are represented For example : contig_103 13 14 0.18691589 107 contig_103 19 20 0.08163265 98 contig_103 37 38 0.10377359 106 contig_103 49 50 0.21238938 113

I know that only CpG modification are represented in my case, but what if I want to plot the probability distribution along my contigs, will I need to specify the lenght and add the nucleotide that are not C myself ? How will you plot the probability for a specific contig here ? Do you have some example code ? Thanks in advance for your awnsers

ArtRand commented 2 months ago

Hello @OceaneMion,

What I would do is first generate the bedMethyl with modkit pileup ${modBAM} ${pileup_bed} --motif CG 0 --ref ${ref}. This file has a lot of information in it and I like to use them as another "core dataset" for analysis, like the modBAM, and I'll derive datasets from this one. Next, convert the bedMethyl into a bedGraph for each modification code:

# in this case making the 'm' bedgraph, but you might want to make one for `h` or `a` as well
$ awk -v OFS="\t" '($4=="m"){$1=$1;print $1, $2, $3, $11}' pileup.bed > m_pileup.bedgraph

Then convert the bedgraph to bigWig with kenttools (linux folder here) using bedGraphToBigWig:

$ bedGraphToBigWig m_pileup.bedgraph ${reference_sizes} m_pileup.bw

The m_pileup.bw bigWig file will allow you to visualize exactly what you're looking for in a genome browser such as IGV or JBrowse.

However, if you want to "fill the gaps" in a bedMethyl file and create a bedgraph with all the positions here is a shell script that will do that.

You will need:

${modbam}: path to the modBAM of reads with an associated index
${ref}: reference FASTA file
${ref_sizes}: a tab-separated table of the contig sizes (seebedtools complement` for specifics).

#! /bin/bash

set -eu

${modkit} pileup ${modbam} pileup.bed --motif CG 0 --ref ./CGI_ladder_3.6kb_ref.fa

bedtools complement -i ./pileup.bed -g ${ref_sizes} | awk -v OFS="\t" '{$1=$1; print $1, $2, $3, ".", 0, 0.0}' > pileup_complement.bed

awk -v OFS="\t" '{$1=$1; print $1, $2, $3, $4, $5, $11}' ./pileup.bed > tmp_pileup.bed

cat ./pileup_complement.bed ./tmp_pileup.bed | bedtools sort > pileup_complemented.bed

I'd highly recommend just going the genome browser route.

OceaneMion commented 2 months ago

Hi thanks for your awnser, I wanted a bash script because I will need to correlate the methylation frequency with transposable elements positions. The code is just what I needed thanks ! But instead I would I've prefer to have all the position and not a span for example here is the output I have

contig_1 0 2 . 0 0 contig_1 2 3 h 213 2.82 contig_1 2 3 m 213 93.43 contig_1 3 4 h 198 23.74 contig_1 3 4 m 198 72.22 contig_1 4 5 h 233 3.43 contig_1 4 5 m 233 92.27 contig_1 5 6 h 200 24.50 contig_1 5 6 m 200 73.50 contig_1 6 47 . 0 0

you see that for the first line I have from 0 to 2 but I would have prefered to have 0 to 1 then 1 to 2, do you know how I can achieve that ? Again thanks for you help.

OceaneMion commented 2 months ago

Nevermind I manage to found how to do it ! Again thanks a lot for your help !

ArtRand commented 2 months ago

Awesome. Feel free to re-open this issue if you have any more questions.

nanoporetech / modkit

Not all genomic position represented #170