Open SimonChen1997 opened 1 month ago
I also looked into the pileup bed file before subseting using this command:
awk -F"\t" '$10 != ($12 + $13) {print $0}' ecoli_a_2_pileup_C0.bed | head
while it showed:
NC_002695.2 1 2 m 29 - 1 2 255,0,0 29 0.00 0 28 1 0 0 0 0
NC_002695.2 23 24 m 31 - 23 24 255,0,0 31 0.00 0 29 2 0 0 0 0
NC_002695.2 25 26 m 32 + 25 26 255,0,0 32 0.00 0 31 1 0 0 0 0
NC_002695.2 31 32 m 31 - 31 32 255,0,0 31 0.00 0 30 1 0 0 0 0
NC_002695.2 33 34 m 31 + 33 34 255,0,0 31 3.23 1 28 2 0 1 0 0
NC_002695.2 33 34 21839 31 + 33 34 255,0,0 31 6.45 2 28 1 0 1 0 0
NC_002695.2 39 40 m 29 - 39 40 255,0,0 29 0.00 0 26 3 0 2 0 0
NC_002695.2 53 54 m 28 - 53 54 255,0,0 28 0.00 0 27 1 0 3 0 0
NC_002695.2 72 73 m 31 + 72 73 255,0,0 31 0.00 0 28 3 0 2 0 0
NC_002695.2 79 80 m 29 - 79 80 255,0,0 29 0.00 0 28 1 1 1 0 0
However, when I used these complete pileup bed files (no subseting), the dmr function worked well without errors.
Hello @SimonChen1997,
You cannot filter the bedMethyl as you have done prior to using modkit dmr
. You need to have the records for each base modification (4mC and 5mC in this case) in the bedMethyl for the algorithm to work, this is what those log lines are trying to tell you (albeit cryptically).
You cannot filter the bedMethyl as you have done prior to using
modkit dmr
. You need to have the records for each base modification (4mC and 5mC in this case) in the bedMethyl for the algorithm to work, this is what those log lines are trying to tell you (albeit cryptically).
Hi ArtRand,
Thanks for your reply. But is there any way to do only 4mC DMR?
I would recommend using the --ignore m
option to pileup in order to remove the 5mC calls from the pileup.
I would recommend using the
--ignore m
option to pileup in order to remove the 5mC calls from the pileup.
thanks for reply. --ignore m
worked. However, I had another issue. I rebasecalled the m6A, m4C, and m5C at the same time, and then I run these codes:
modkit adjust-mods $minimap2/ecoli_${SLURM_ARRAY_TASK_ID}.sorted.bam stdout --ignore m | modkit adjust-mods stdin $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam --ignore 21839
samtools view -bhS $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam | \
samtools sort -T $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted -o $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam
samtools index $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam
modkit pileup $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam $modkit_pileup/ecoli_${SLURM_ARRAY_TASK_ID}_pileup_m6a.bed --motif A 0 --ref $ref
awk -F "\t" '{print $4}' ecoli_${SLURM_ARRAY_TASK_ID}_pileup_m6a.bed | sort | uniq
it showed these two values in all files:
a
C
what does the C mean here?
the bed file is like this:
NC_002695.2 124 125 a 21 - 124 125 255,0,0 21 0.00 0 21 0 0 0 0 0
NC_002695.2 125 126 C 1 + 125 126 255,0,0 1 0.00 0 1 0 0 0 22 0
NC_002695.2 125 126 a 22 + 125 126 255,0,0 22 0.00 0 22 0 0 0 1 0
Hello @SimonChen1997,
Sorry for taking so long to respond to this thread.
The short answer is the C
means "any cytosine modification". The reason this shows up is at position 125 you have a single read with an A>C mismatch. The reason this shows up after you've --ignore
d the m5C and m4C calls is a little more nuanced, it helps to break it down step by step:
modkit adjust-mods $minimap2/ecoli_${SLURM_ARRAY_TASK_ID}.sorted.bam stdout --ignore m
This step removes the m5C probabilities (using the algorithm here) leaving you with unmodified and 4mC probabilities.
modkit adjust-mods stdin $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam --ignore 21839
The second step removes the m4C probabilities, leaving you with only unmodified probabilities. All of these probabilities will be 1.0, since all of the probability mass has been moved into the unmodified class.
If you were to run modkit summary
on this modBAM you will probably see something like this:
# bases A,C
# total_reads_used 10042
# count_reads_C 9982
# count_reads_A 10000
# pass_threshold_A 0.6855469
# pass_threshold_C 1
base code pass_count pass_frac all_count all_frac
C - 7402043 1 7402043 1
C C 0 0 0 0
A - 9368729 0.91072136 10038271 0.8788549
A a 918423 0.089278646 1383718 0.1211451
What this is saying is you have "any-cytosine modification" information, however all of the calls (pass_frac
or all_frac
) are unmodified (-
).
I think what you want is to fully remove the cytosine modification information from the reads and/or only keep the bedMethyl records for modifications that apply to the reference base (i.e. ignore mismatches). The next version of Modkit will have more "tag manipulation" functionality so you could remove the cytosine modification information if you want. I generally recommend against this, however, since I like to think of modBAMs as an immutable data source and I'd rather filter or select the data I want from them during a transformation instead of copying data around. If you want a bedMethyl that only has records corresponding to reference adenine bases (based on your use of --motif A 0
) then I would filter the output of pileup with awk ($4=='a')
. Also if you're going to use these data in dmr
, that command is smart enough to only use base modifications that modify the primary sequence base that the use specifies - so you don't need to do this step ahead of time.
Also, another option for your dmr
work above is to parse the output and look for regions where only the m4C levels change.
Hi,
I ran the
dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v2
modified model from Dorado. And I use modkit to generate the bed file including 4mC and 5mC positions using this command:and then I use awk to subset the bed file to only contain 4mC positions:
Finally, I use the dmr function to see the differential methylated position (4mC):
However, it showed:
and the log file showed:
Is there any better way to subset the pileup bed file to 4mC and do dmr?
Cheers, Ziming