Closed vetmohit89 closed 2 months ago
@vetmohit89 Could you tell me the version of modkit and dorado you're using? It would be good to check that the modified base tags are present in the BAM (just as a sanity check). Maybe take a quick look at the BAM on a genome viewer? Also for your modkit
command, you shouldn't use --cpg
since that will narrow to CpG motifs. You've run modification calling at DRACH motifs, so I would use --motif DRACH 2
instead. Lastly, could you attach a few lines of the debug log file with the whole error?
modkit --version mod_kit 0.2.5
dorado --version 0.5.0+0d932c0
I am not sure if this is right way to check MD tag using IGV. Here is snapshot of one reads from bam file generated using
Hello @vetmohit89,
It looks to me like your reads are missing modified base information. There should be MM
, ML
, and optionally MN
tags present. For example:
I'd recommend re-basecalling a small subset of reads with the latest dorado build and re-checking with modkit summary
and inspecting the reads on IGV. Coloring by base modification also helps. Let me know how it goes.
Hello @ArtRand,
I updated dorado version:
dorado -v 0.5.3+d9af343
modkit --version mod_kit 0.2.5
but still getting similar error upon checking with modkit summary.
failed to get modbase info for record cc949777-e06c-491a-b8de-d355d4dc5380, Skipped: AUX data not found Here is IGV snapshot ..
Is it something to do with converting fast5 to pod5? I have originally collected RNA004 data in fast5 format and then converted to pod5.
Hello @vetmohit89,
I tried to reproduce the problem with some internal data.
Using dorado-0.5.3+d9af343.
${dorado} basecaller rna004_130bps_sup@v3.0.1 ${pod5} \
--modified-bases m6A_DRACH \
--max-reads 10000 \
--reference ${ref} > ${outdir}/basecalls_m6A.bam
When plotting on IGV I can see modified bases and the read information has the tags:
Try the command I have above? Could you also tell me the compute infrastructure you're on? It might help diagnose the problem.
I am keep getting similar error. I have generated a small test dataset and saved in the box. Please help me fixing this issue.
https://uab.box.com/s/g3g3nlg53jko2xqtwv89rzrnkkh2v6x2
Thank you Mohit
I am using samtools version 1.12. Could this be affecting the MD tags? Thank you Mohit
Hello @vetmohit89,
I basecalled the pod5s you put on BOX with the following command:
$ dorado basecaller ${basecaller} ${pod5} \
--modified-bases m6A_DRACH \
--reference ${ref} > user_basecalls_m6A.bam
Then I filtered to only primary and mapped alignments:
$ samtools view -bh -F 260 user_basecalls_m6A.bam > user_basecalls_m6A_primary_mapped.bam
Then I used modkit extract
to get the base modification probabilities for each read:
$ modkit extract user_basecalls_m6A_primary_mapped.bam extract.tsv --force --log-filepath extract.log --read-calls calls.tsv
The final log line from modkit was:
> processed 16365 reads, 302008 rows, skipped ~13236 reads, failed ~0 reads
Checking the extract:
# discard the header line
$ awk 'NR>1 {print $1}' extract.tsv| uniq | wc -l
# returns: 16365
Inspecting extact.log there are 37 lines like this:
[src/mod_bam.rs::84][2024-02-23 14:34:53][DEBUG] record 7e614a1f-2563-4da9-8813-062e769aebfd has no base modification information, skipping
So I took a look at those manually:
import pysam
missing = {
"19cd773a-b046-4968-8d9b-a7b1df273f6f",
"02932648-c35a-4e08-b647-a5ecd49311bd",
"0f0366b0-eb45-4a09-9f05-b2dc54951afa",
"48042651-5d3c-4ec2-bd26-ce077341f555",
"a887f8a6-651e-41f9-893b-b8fded7a21ef",
"1047e753-f590-4dee-879b-33bf58fb513d",
"b7c781e2-d7c3-41df-a8b9-28412fd52fb1",
"2cf78adc-07c7-42f6-bd0b-f7755a28647a",
"3e85deda-6f48-437e-ba9b-4518930555f6",
"4ee05dba-14c2-4688-abaa-f26fd25b5a81",
"64b240c3-4928-4c31-acee-9ecf3f75bed3",
"9904a03e-affc-4758-925c-d194db3d366d",
"2775d283-5106-4487-829f-a840362ebb96",
"af8e1339-8c9e-4bc2-94a2-d31ff9b0d6cb",
"43599f1c-7353-4df1-b3dd-4468428b4c6d",
"cca0a6d9-fc23-4cc2-bfc0-ef7ee6fb08ca",
"d6620532-4a36-4cc6-8eae-eb88b70b391f",
"175f735e-253c-44d2-bc13-d60b5ce10c43",
"991050c3-d49a-479f-9a5c-60e146100a5f",
"a2e551b3-ae80-4cbe-bf1e-6718091bdb0f",
"b0e0b9f1-3511-41e7-9884-769bff34ea8b",
"1c17ea5b-412d-4f28-92eb-e602f8de4b99",
"326634cd-099a-4de7-a759-4d5f2bca46bc",
"48cf565b-4df8-437b-8b00-db56fdd8eafa",
"6d301cda-407d-488f-a778-f0613ec2eb7e",
"7fe29c81-e22c-4947-bc5f-c9884bfe213b",
"8e308008-efb8-4a5a-b142-483a66f42c6b",
"c8cc93e2-8408-4054-82c9-d02fe7c5de2a",
"ce5d4012-77ef-498e-a111-ec84b617e181",
"d34e66c8-27cb-4201-aa92-3e145f70b3eb",
"01ceb673-2c5e-4b60-a602-32ad0703865d",
"0eef5f88-a67d-4fc1-80f1-79ed6d471c82",
"25db84f0-d0dc-4024-b3e3-95df592e7f7c",
"651edd40-4aed-42a3-9dbd-75a5f0577a8e",
"7e614a1f-2563-4da9-8813-062e769aebfd",
"bcd82df3-4003-4a90-a689-48766e925af8",
"fc053432-d32f-4eb2-96f1-580e22356352",
}
with pysam.AlignmentFile("user_basecalls_m6A_primary_mapped.bam", "rb") as bam:
records = [
r for r in bam if r.query_name in missing
]
for record in records:
try:
MM = record.get_tag("MM")
print(f"{record.query_name}: {MM}")
except KeyError:
print(f"{record.query_name}, no MM tag")
The output indeed shows that these records do not have any base modification probabilities:
19cd773a-b046-4968-8d9b-a7b1df273f6f: A+a?;
02932648-c35a-4e08-b647-a5ecd49311bd: A+a?;
0f0366b0-eb45-4a09-9f05-b2dc54951afa: A+a?;
48042651-5d3c-4ec2-bd26-ce077341f555: A+a?;
a887f8a6-651e-41f9-893b-b8fded7a21ef: A+a?;
1047e753-f590-4dee-879b-33bf58fb513d: A+a?;
1047e753-f590-4dee-879b-33bf58fb513d, no MM tag
b7c781e2-d7c3-41df-a8b9-28412fd52fb1: A+a?;
2cf78adc-07c7-42f6-bd0b-f7755a28647a: A+a?;
3e85deda-6f48-437e-ba9b-4518930555f6: A+a?;
4ee05dba-14c2-4688-abaa-f26fd25b5a81: A+a?;
64b240c3-4928-4c31-acee-9ecf3f75bed3: A+a?;
9904a03e-affc-4758-925c-d194db3d366d: A+a?;
2775d283-5106-4487-829f-a840362ebb96: A+a?;
af8e1339-8c9e-4bc2-94a2-d31ff9b0d6cb: A+a?;
43599f1c-7353-4df1-b3dd-4468428b4c6d: A+a?;
cca0a6d9-fc23-4cc2-bfc0-ef7ee6fb08ca: A+a?;
d6620532-4a36-4cc6-8eae-eb88b70b391f: A+a?;
d6620532-4a36-4cc6-8eae-eb88b70b391f, no MM tag
175f735e-253c-44d2-bc13-d60b5ce10c43: A+a?;
991050c3-d49a-479f-9a5c-60e146100a5f: A+a?;
a2e551b3-ae80-4cbe-bf1e-6718091bdb0f: A+a?;
b0e0b9f1-3511-41e7-9884-769bff34ea8b: A+a?;
1c17ea5b-412d-4f28-92eb-e602f8de4b99: A+a?;
326634cd-099a-4de7-a759-4d5f2bca46bc: A+a?;
48cf565b-4df8-437b-8b00-db56fdd8eafa: A+a?;
6d301cda-407d-488f-a778-f0613ec2eb7e: A+a?;
7fe29c81-e22c-4947-bc5f-c9884bfe213b: A+a?;
8e308008-efb8-4a5a-b142-483a66f42c6b: A+a?;
c8cc93e2-8408-4054-82c9-d02fe7c5de2a: A+a?;
ce5d4012-77ef-498e-a111-ec84b617e181: A+a?;
d34e66c8-27cb-4201-aa92-3e145f70b3eb: A+a?;
01ceb673-2c5e-4b60-a602-32ad0703865d: A+a?;
0eef5f88-a67d-4fc1-80f1-79ed6d471c82: A+a?;
25db84f0-d0dc-4024-b3e3-95df592e7f7c: A+a?;
651edd40-4aed-42a3-9dbd-75a5f0577a8e: A+a?;
7e614a1f-2563-4da9-8813-062e769aebfd: A+a?;
bcd82df3-4003-4a90-a689-48766e925af8: A+a?;
fc053432-d32f-4eb2-96f1-580e22356352: A+a?;
Then I used the -Y
flag so that non-primary alignments will have valid MM/ML/MN tags:
$ dorado basecaller ${basecaller} ${pod5} \
--modified-bases m6A_DRACH \
-Y \
--reference ${ref} > ${outdir}/user_basecalls_m6A_Y.bam
Followed by modkit extract
:
$ modkit extract user_basecalls_m6A_Y.bam extract_Y.tsv --log-filepath extract_Y.log --force --read-calls calls_Y.tsv --allow-non-primary
The final log line shows:
[src/extract/subcommand.rs::907][2024-02-23 14:46:26][INFO] processed 42900 reads, 473826 rows, skipped ~300 reads, failed ~0 reads
Indeed if you count the log lines stating the record is skipped:
$ grep "skipping" extract_Y.log| wc -l
# returns 300
So basically, everything is making sense from my side. Keep in mind that without the -Y
flag, non-primary alignments cannot be used by modkit (this is documented). It's possible that the records you're seeing in IGV are not primary and don't have base modification information, dorado
will omit the MM/ML/MN flags on non-primary alignments.
I've attached user_basecalls_m6A_primary_mapped.bam
to this thread, everything else is too big. You should be able to reproduce them with the inputs in the pod5/
folder you hosted on BOX, and you should be able to reproduce the results I have here with the current release of modkit
. Let me know if you have any problems.
@vetmohit89 any luck?
Dear @ArtRand ,
My apologies for late response. But it did not work for me. I followed exact same commands you mentioned except
dorado basecaller ${model} ${pod5} \
you have used dorado basecaller ${basecaller} ${pod5} \
I have saved my bam file in the same box folder.
I am not sure what am I missing in bam file. Can you please try to run modkit from bam file which I generated.
When I used bam file you have generated modkit works perfectly fine.
@vetmohit89, the file test_m6a_2.bam
certainly does not have any MM/ML/MN tags. Could you make sure that you are specifying the rna004_130bps_sup@v3.0.1
basecalling model? Unfortunately, your problem seems to be upstream of modkit
(i.e. generating the modbase tags), maybe open an issue with dorado with the exact command you're using and the hardware you have?
Hello @ArtRand
I have created issue on dorado page:https://github.com/nanoporetech/dorado/issues/671
Thank you Mohit
Look like this was taken care of by the Dorado team.
Dear,
I have collected RNA004 data in fast5 format. Thereafter, I converted individual fast5 to pod5 with --strict argument.
Here is my dorado command:
dorado basecaller \ --min-qscore 10 \ --verbose \ --emit-moves \ -k 14 \ --secondary=no \ --reference ./GRCh38.p13.genome.fa \ sup \ ./sample_pod5/ --modified-bases m6A_DRACH > /sample_pod5/sample_emit_moves_m6a.bam
Followed by samtools command to sort and index it. But when I am running modkit command: modkit pileup ./sorted_sample_emit_moves_m6a.bam pileup.bed --cpg --ref ./GRCh38.p13.genome.fa. It is giving error: failed to get modbase info