nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
136 stars 7 forks source link

CpG Calling with Reference #196

Closed SophieValk closed 3 months ago

SophieValk commented 3 months ago

Am I understanding it correctly that when you call CpGs using modekit pileup with the -ref flag, only the CpGs that are present in the reference genome will get called? And any site that is a CpG in you .bam files but not in the reference genome will not be included in the modkit pileup .bedMethyl output file?

Thanks in advance, Sophie

ArtRand commented 3 months ago

Hello @SophieValk,

If you use a command such as this

$ modkit pileup ${mod_bam} ${pileup_bedmethyl} --cpg --ref ${ref}

The bedmethyl records will only be for positions that are CG in the reference FASTA file. You are correct that if a read in the mod-BAM has a CpG methylation call that is not in the reference sequence (i.e. not aligned) it will not be present in the output. If you want to look at unaligned CpG calls you can first filter the mod-BAM to only unaligned records:

$ samtools -f 4 ${modbam} > unaligned.bam

then look at the calls with modkit extract

$ modkit extract unaligned.bam raw_probabilities.tsv --read-calls read_calls.tsv

If you want the positions that were CpG in the read but not CpG in the reference (reference CH), you can use dorado with the CpG modification model and then use

$ modkit pileup ${mod_bam} ${pileup_ch_bedmethyl} --motif CH 0 --ref ${ref}

I'm going to add functionality that will allow you to subset the modification calls in a set of reads to specific read sequences in the next release, right now you'll have to use the CpG modification model.

Hope this helps, happy to answer any more questions you have.

SophieValk commented 3 months ago

Thanks @ArtRand for your clear and quick reply. Sophie