nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
143 stars 8 forks source link

modkit dmr pair of pileup bed file subset to 4mC #279

Open SimonChen1997 opened 1 month ago

SimonChen1997 commented 1 month ago

Hi,

I ran the dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v2 modified model from Dorado. And I use modkit to generate the bed file including 4mC and 5mC positions using this command:

modkit pileup $minimap2_primary/ecoli_a_${SLURM_ARRAY_TASK_ID}_primary.sorted.bam $modkit_pileup_primary/ecoli_a_${SLURM_ARRAY_TASK_ID}_pileup_C0.bed --motif C 0 --ref $ref

and then I use awk to subset the bed file to only contain 4mC positions:

awk -F"\t" '$4=="21839" {print $0}' $file > $subset_m4C/${file}_5mC.bed

Finally, I use the dmr function to see the differential methylated position (4mC):

modkit dmr pair \
-a $pileup_m4C/ecoli_a_2_4mC.bed.gz \
-a $pileup_m4C/ecoli_a_3_4mC.bed.gz \
-b $pileup_m4C/ecoli_b_2_4mC.bed.gz \
-b $pileup_m4C/ecoli_b_3_4mC.bed.gz \
--ref $ref \
--out-path $modkit_dmr_file_m4C \
--header \
--base C \
-t 4

However, it showed:

finished, processed 0 sites successfully, 0 failed

and the log file showed:

[src/dmr/subcommands.rs::328][2024-10-16 09:55:03][DEBUG] using primary sequence base C
[src/dmr/subcommands.rs::406][2024-10-16 09:55:03][INFO] reading reference FASTA at "/scratch/project/genoepic_rumen/e_coli_reference/ncbi_ref/GCF_000008865.2_ASM886v2_genomic.fna"
[src/dmr/subcommands.rs::421][2024-10-16 09:55:03][INFO] running single-site analysis
[src/dmr/single_site.rs::76][2024-10-16 09:55:03][INFO] using default prior, Beta(α: 0.55, β: 0.55)
[src/dmr/single_site.rs::751][2024-10-16 09:55:03][INFO] estimating max coverages from data
[src/dmr/single_site.rs::795][2024-10-16 09:55:05][INFO] sampled 1073328 a records and 1075105 b records, calculating max coverages for 95th percentile
[src/dmr/single_site.rs::820][2024-10-16 09:55:05][INFO] calculated max coverage for a: 55 and b: 56
[src/dmr/single_site.rs::140][2024-10-16 09:55:05][INFO] running with replicates and matched samples
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (21) is not equal to the sum of canonical and modified counts (20), chrom: NC_002695.2 starting at 1105536
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (21) is not equal to the sum of canonical and modified counts (19), chrom: NC_002695.2 starting at 806329
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (20) is not equal to the sum of canonical and modified counts (17), chrom: NC_002695.2 starting at 251648
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (29) is not equal to the sum of canonical and modified counts (1), chrom: NC_002695.2 starting at 976284
[src/dmr/single_site.rs::211][2024-10-16 09:55:08][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (33) is not equal to the sum of canonical and modified counts (32), chrom: NC_002695.2 starting at 514816
[src/dmr/single_site.rs::211][2024-10-16 09:55:08][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (53) is not equal to the sum of canonical and modified counts (52), chrom: NC_002695.2 starting at 186917

Is there any better way to subset the pileup bed file to 4mC and do dmr?

Cheers, Ziming

SimonChen1997 commented 1 month ago

I also looked into the pileup bed file before subseting using this command:

awk -F"\t" '$10 != ($12 + $13) {print $0}' ecoli_a_2_pileup_C0.bed | head

while it showed:

NC_002695.2 1   2   m   29  -   1   2   255,0,0 29  0.00    0   28  1   0   0   0   0
NC_002695.2 23  24  m   31  -   23  24  255,0,0 31  0.00    0   29  2   0   0   0   0
NC_002695.2 25  26  m   32  +   25  26  255,0,0 32  0.00    0   31  1   0   0   0   0
NC_002695.2 31  32  m   31  -   31  32  255,0,0 31  0.00    0   30  1   0   0   0   0
NC_002695.2 33  34  m   31  +   33  34  255,0,0 31  3.23    1   28  2   0   1   0   0
NC_002695.2 33  34  21839   31  +   33  34  255,0,0 31  6.45    2   28  1   0   1   0   0
NC_002695.2 39  40  m   29  -   39  40  255,0,0 29  0.00    0   26  3   0   2   0   0
NC_002695.2 53  54  m   28  -   53  54  255,0,0 28  0.00    0   27  1   0   3   0   0
NC_002695.2 72  73  m   31  +   72  73  255,0,0 31  0.00    0   28  3   0   2   0   0
NC_002695.2 79  80  m   29  -   79  80  255,0,0 29  0.00    0   28  1   1   1   0   0

However, when I used these complete pileup bed files (no subseting), the dmr function worked well without errors.

ArtRand commented 1 month ago

Hello @SimonChen1997,

You cannot filter the bedMethyl as you have done prior to using modkit dmr. You need to have the records for each base modification (4mC and 5mC in this case) in the bedMethyl for the algorithm to work, this is what those log lines are trying to tell you (albeit cryptically).

SimonChen1997 commented 1 month ago

You cannot filter the bedMethyl as you have done prior to using modkit dmr. You need to have the records for each base modification (4mC and 5mC in this case) in the bedMethyl for the algorithm to work, this is what those log lines are trying to tell you (albeit cryptically).

Hi ArtRand,

Thanks for your reply. But is there any way to do only 4mC DMR?

marcus1487 commented 1 month ago

I would recommend using the --ignore m option to pileup in order to remove the 5mC calls from the pileup.

SimonChen1997 commented 3 weeks ago

I would recommend using the --ignore m option to pileup in order to remove the 5mC calls from the pileup.

thanks for reply. --ignore m worked. However, I had another issue. I rebasecalled the m6A, m4C, and m5C at the same time, and then I run these codes:

modkit adjust-mods $minimap2/ecoli_${SLURM_ARRAY_TASK_ID}.sorted.bam stdout --ignore m | modkit adjust-mods stdin $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam --ignore 21839

samtools view -bhS $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam | \
samtools sort -T $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted -o $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam

samtools index $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam

modkit pileup $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam $modkit_pileup/ecoli_${SLURM_ARRAY_TASK_ID}_pileup_m6a.bed --motif A 0 --ref $ref

awk -F "\t" '{print $4}' ecoli_${SLURM_ARRAY_TASK_ID}_pileup_m6a.bed | sort | uniq

it showed these two values in all files:

a
C

what does the C mean here?

the bed file is like this:

NC_002695.2     124     125     a       21      -       124     125     255,0,0 21      0.00    0       21      0       0       0       0       0
NC_002695.2     125     126     C       1       +       125     126     255,0,0 1       0.00    0       1       0       0       0       22      0
NC_002695.2     125     126     a       22      +       125     126     255,0,0 22      0.00    0       22      0       0       0       1       0
ArtRand commented 2 weeks ago

Hello @SimonChen1997,

Sorry for taking so long to respond to this thread.

The short answer is the C means "any cytosine modification". The reason this shows up is at position 125 you have a single read with an A>C mismatch. The reason this shows up after you've --ignored the m5C and m4C calls is a little more nuanced, it helps to break it down step by step:

modkit adjust-mods $minimap2/ecoli_${SLURM_ARRAY_TASK_ID}.sorted.bam stdout --ignore m

This step removes the m5C probabilities (using the algorithm here) leaving you with unmodified and 4mC probabilities.

modkit adjust-mods stdin $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam --ignore 21839

The second step removes the m4C probabilities, leaving you with only unmodified probabilities. All of these probabilities will be 1.0, since all of the probability mass has been moved into the unmodified class.

If you were to run modkit summary on this modBAM you will probably see something like this:

# bases             A,C
# total_reads_used  10042
# count_reads_C     9982
# count_reads_A     10000
# pass_threshold_A  0.6855469
# pass_threshold_C  1
 base  code  pass_count  pass_frac    all_count  all_frac
 C     -     7402043     1            7402043    1
 C     C     0           0            0          0
 A     -     9368729     0.91072136   10038271   0.8788549
 A     a     918423      0.089278646  1383718    0.1211451

What this is saying is you have "any-cytosine modification" information, however all of the calls (pass_frac or all_frac) are unmodified (-).

I think what you want is to fully remove the cytosine modification information from the reads and/or only keep the bedMethyl records for modifications that apply to the reference base (i.e. ignore mismatches). The next version of Modkit will have more "tag manipulation" functionality so you could remove the cytosine modification information if you want. I generally recommend against this, however, since I like to think of modBAMs as an immutable data source and I'd rather filter or select the data I want from them during a transformation instead of copying data around. If you want a bedMethyl that only has records corresponding to reference adenine bases (based on your use of --motif A 0) then I would filter the output of pileup with awk ($4=='a'). Also if you're going to use these data in dmr, that command is smart enough to only use base modifications that modify the primary sequence base that the use specifies - so you don't need to do this step ahead of time.

Also, another option for your dmr work above is to parse the output and look for regions where only the m4C levels change.