Closed SophieValk closed 3 months ago
Hello @SophieValk,
If you use a command such as this
$ modkit pileup ${mod_bam} ${pileup_bedmethyl} --cpg --ref ${ref}
The bedmethyl records will only be for positions that are CG
in the reference FASTA file. You are correct that if a read in the mod-BAM has a CpG methylation call that is not in the reference sequence (i.e. not aligned) it will not be present in the output. If you want to look at unaligned CpG calls you can first filter the mod-BAM to only unaligned records:
$ samtools -f 4 ${modbam} > unaligned.bam
then look at the calls with modkit extract
$ modkit extract unaligned.bam raw_probabilities.tsv --read-calls read_calls.tsv
If you want the positions that were CpG in the read but not CpG in the reference (reference CH), you can use dorado
with the CpG modification model and then use
$ modkit pileup ${mod_bam} ${pileup_ch_bedmethyl} --motif CH 0 --ref ${ref}
I'm going to add functionality that will allow you to subset the modification calls in a set of reads to specific read sequences in the next release, right now you'll have to use the CpG modification model.
Hope this helps, happy to answer any more questions you have.
Thanks @ArtRand for your clear and quick reply. Sophie
Am I understanding it correctly that when you call CpGs using modekit pileup with the -ref flag, only the CpGs that are present in the reference genome will get called? And any site that is a CpG in you .bam files but not in the reference genome will not be included in the modkit pileup .bedMethyl output file?
Thanks in advance, Sophie