nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
139 stars 8 forks source link

failed to get modbase info ... skipped: AUX data not found #134

Closed vetmohit89 closed 2 months ago

vetmohit89 commented 8 months ago

Dear,

I have collected RNA004 data in fast5 format. Thereafter, I converted individual fast5 to pod5 with --strict argument.

Here is my dorado command:

dorado basecaller \ --min-qscore 10 \ --verbose \ --emit-moves \ -k 14 \ --secondary=no \ --reference ./GRCh38.p13.genome.fa \ sup \ ./sample_pod5/ --modified-bases m6A_DRACH > /sample_pod5/sample_emit_moves_m6a.bam

Followed by samtools command to sort and index it. But when I am running modkit command: modkit pileup ./sorted_sample_emit_moves_m6a.bam pileup.bed --cpg --ref ./GRCh38.p13.genome.fa. It is giving error: failed to get modbase info

ArtRand commented 8 months ago

@vetmohit89 Could you tell me the version of modkit and dorado you're using? It would be good to check that the modified base tags are present in the BAM (just as a sanity check). Maybe take a quick look at the BAM on a genome viewer? Also for your modkit command, you shouldn't use --cpg since that will narrow to CpG motifs. You've run modification calling at DRACH motifs, so I would use --motif DRACH 2 instead. Lastly, could you attach a few lines of the debug log file with the whole error?

vetmohit89 commented 8 months ago

modkit --version mod_kit 0.2.5

dorado --version 0.5.0+0d932c0

I am not sure if this is right way to check MD tag using IGV. Here is snapshot of one reads from bam file generated using image

ArtRand commented 8 months ago

Hello @vetmohit89, It looks to me like your reads are missing modified base information. There should be MM, ML, and optionally MN tags present. For example: image

I'd recommend re-basecalling a small subset of reads with the latest dorado build and re-checking with modkit summary and inspecting the reads on IGV. Coloring by base modification also helps. Let me know how it goes.

vetmohit89 commented 8 months ago

Hello @ArtRand,

I updated dorado version:

dorado -v 0.5.3+d9af343

modkit --version mod_kit 0.2.5

but still getting similar error upon checking with modkit summary.

failed to get modbase info for record cc949777-e06c-491a-b8de-d355d4dc5380, Skipped: AUX data not found Here is IGV snapshot image..

Is it something to do with converting fast5 to pod5? I have originally collected RNA004 data in fast5 format and then converted to pod5.

ArtRand commented 8 months ago

Hello @vetmohit89,

I tried to reproduce the problem with some internal data.

Using dorado-0.5.3+d9af343.

${dorado} basecaller rna004_130bps_sup@v3.0.1 ${pod5} \
    --modified-bases m6A_DRACH \
    --max-reads 10000 \
    --reference ${ref} > ${outdir}/basecalls_m6A.bam

When plotting on IGV I can see modified bases and the read information has the tags: image

Try the command I have above? Could you also tell me the compute infrastructure you're on? It might help diagnose the problem.

vetmohit89 commented 8 months ago

I am keep getting similar error. I have generated a small test dataset and saved in the box. Please help me fixing this issue.

https://uab.box.com/s/g3g3nlg53jko2xqtwv89rzrnkkh2v6x2

Thank you Mohit

vetmohit89 commented 8 months ago

I am using samtools version 1.12. Could this be affecting the MD tags? Thank you Mohit

ArtRand commented 8 months ago

Hello @vetmohit89,

I basecalled the pod5s you put on BOX with the following command:

$ dorado basecaller ${basecaller} ${pod5} \
    --modified-bases m6A_DRACH \
    --reference ${ref} > user_basecalls_m6A.bam

Then I filtered to only primary and mapped alignments:

$ samtools view -bh -F 260 user_basecalls_m6A.bam > user_basecalls_m6A_primary_mapped.bam

Then I used modkit extract to get the base modification probabilities for each read:

$ modkit extract user_basecalls_m6A_primary_mapped.bam extract.tsv --force --log-filepath extract.log --read-calls calls.tsv

The final log line from modkit was:

> processed 16365 reads, 302008 rows, skipped ~13236 reads, failed ~0 reads

Checking the extract:

# discard the header line
$ awk 'NR>1 {print $1}' extract.tsv| uniq | wc -l
# returns:  16365 

Inspecting extact.log there are 37 lines like this:

[src/mod_bam.rs::84][2024-02-23 14:34:53][DEBUG] record 7e614a1f-2563-4da9-8813-062e769aebfd has no base modification information, skipping

So I took a look at those manually:

import pysam

missing = {
    "19cd773a-b046-4968-8d9b-a7b1df273f6f",
    "02932648-c35a-4e08-b647-a5ecd49311bd",
    "0f0366b0-eb45-4a09-9f05-b2dc54951afa",
    "48042651-5d3c-4ec2-bd26-ce077341f555",
    "a887f8a6-651e-41f9-893b-b8fded7a21ef",
    "1047e753-f590-4dee-879b-33bf58fb513d",
    "b7c781e2-d7c3-41df-a8b9-28412fd52fb1",
    "2cf78adc-07c7-42f6-bd0b-f7755a28647a",
    "3e85deda-6f48-437e-ba9b-4518930555f6",
    "4ee05dba-14c2-4688-abaa-f26fd25b5a81",
    "64b240c3-4928-4c31-acee-9ecf3f75bed3",
    "9904a03e-affc-4758-925c-d194db3d366d",
    "2775d283-5106-4487-829f-a840362ebb96",
    "af8e1339-8c9e-4bc2-94a2-d31ff9b0d6cb",
    "43599f1c-7353-4df1-b3dd-4468428b4c6d",
    "cca0a6d9-fc23-4cc2-bfc0-ef7ee6fb08ca",
    "d6620532-4a36-4cc6-8eae-eb88b70b391f",
    "175f735e-253c-44d2-bc13-d60b5ce10c43",
    "991050c3-d49a-479f-9a5c-60e146100a5f",
    "a2e551b3-ae80-4cbe-bf1e-6718091bdb0f",
    "b0e0b9f1-3511-41e7-9884-769bff34ea8b",
    "1c17ea5b-412d-4f28-92eb-e602f8de4b99",
    "326634cd-099a-4de7-a759-4d5f2bca46bc",
    "48cf565b-4df8-437b-8b00-db56fdd8eafa",
    "6d301cda-407d-488f-a778-f0613ec2eb7e",
    "7fe29c81-e22c-4947-bc5f-c9884bfe213b",
    "8e308008-efb8-4a5a-b142-483a66f42c6b",
    "c8cc93e2-8408-4054-82c9-d02fe7c5de2a",
    "ce5d4012-77ef-498e-a111-ec84b617e181",
    "d34e66c8-27cb-4201-aa92-3e145f70b3eb",
    "01ceb673-2c5e-4b60-a602-32ad0703865d",
    "0eef5f88-a67d-4fc1-80f1-79ed6d471c82",
    "25db84f0-d0dc-4024-b3e3-95df592e7f7c",
    "651edd40-4aed-42a3-9dbd-75a5f0577a8e",
    "7e614a1f-2563-4da9-8813-062e769aebfd",
    "bcd82df3-4003-4a90-a689-48766e925af8",
    "fc053432-d32f-4eb2-96f1-580e22356352",
}

with pysam.AlignmentFile("user_basecalls_m6A_primary_mapped.bam", "rb") as bam:
    records = [
        r for r in bam if r.query_name in missing
    ]

for record in records:
    try:
        MM = record.get_tag("MM")
        print(f"{record.query_name}: {MM}")
    except KeyError:
        print(f"{record.query_name}, no MM tag")

The output indeed shows that these records do not have any base modification probabilities:

19cd773a-b046-4968-8d9b-a7b1df273f6f: A+a?;
02932648-c35a-4e08-b647-a5ecd49311bd: A+a?;
0f0366b0-eb45-4a09-9f05-b2dc54951afa: A+a?;
48042651-5d3c-4ec2-bd26-ce077341f555: A+a?;
a887f8a6-651e-41f9-893b-b8fded7a21ef: A+a?;
1047e753-f590-4dee-879b-33bf58fb513d: A+a?;
1047e753-f590-4dee-879b-33bf58fb513d, no MM tag
b7c781e2-d7c3-41df-a8b9-28412fd52fb1: A+a?;
2cf78adc-07c7-42f6-bd0b-f7755a28647a: A+a?;
3e85deda-6f48-437e-ba9b-4518930555f6: A+a?;
4ee05dba-14c2-4688-abaa-f26fd25b5a81: A+a?;
64b240c3-4928-4c31-acee-9ecf3f75bed3: A+a?;
9904a03e-affc-4758-925c-d194db3d366d: A+a?;
2775d283-5106-4487-829f-a840362ebb96: A+a?;
af8e1339-8c9e-4bc2-94a2-d31ff9b0d6cb: A+a?;
43599f1c-7353-4df1-b3dd-4468428b4c6d: A+a?;
cca0a6d9-fc23-4cc2-bfc0-ef7ee6fb08ca: A+a?;
d6620532-4a36-4cc6-8eae-eb88b70b391f: A+a?;
d6620532-4a36-4cc6-8eae-eb88b70b391f, no MM tag
175f735e-253c-44d2-bc13-d60b5ce10c43: A+a?;
991050c3-d49a-479f-9a5c-60e146100a5f: A+a?;
a2e551b3-ae80-4cbe-bf1e-6718091bdb0f: A+a?;
b0e0b9f1-3511-41e7-9884-769bff34ea8b: A+a?;
1c17ea5b-412d-4f28-92eb-e602f8de4b99: A+a?;
326634cd-099a-4de7-a759-4d5f2bca46bc: A+a?;
48cf565b-4df8-437b-8b00-db56fdd8eafa: A+a?;
6d301cda-407d-488f-a778-f0613ec2eb7e: A+a?;
7fe29c81-e22c-4947-bc5f-c9884bfe213b: A+a?;
8e308008-efb8-4a5a-b142-483a66f42c6b: A+a?;
c8cc93e2-8408-4054-82c9-d02fe7c5de2a: A+a?;
ce5d4012-77ef-498e-a111-ec84b617e181: A+a?;
d34e66c8-27cb-4201-aa92-3e145f70b3eb: A+a?;
01ceb673-2c5e-4b60-a602-32ad0703865d: A+a?;
0eef5f88-a67d-4fc1-80f1-79ed6d471c82: A+a?;
25db84f0-d0dc-4024-b3e3-95df592e7f7c: A+a?;
651edd40-4aed-42a3-9dbd-75a5f0577a8e: A+a?;
7e614a1f-2563-4da9-8813-062e769aebfd: A+a?;
bcd82df3-4003-4a90-a689-48766e925af8: A+a?;
fc053432-d32f-4eb2-96f1-580e22356352: A+a?;

Then I used the -Y flag so that non-primary alignments will have valid MM/ML/MN tags:

$ dorado basecaller ${basecaller} ${pod5} \
    --modified-bases m6A_DRACH \
    -Y \
    --reference ${ref} > ${outdir}/user_basecalls_m6A_Y.bam

Followed by modkit extract:

$ modkit extract user_basecalls_m6A_Y.bam extract_Y.tsv --log-filepath extract_Y.log --force --read-calls calls_Y.tsv --allow-non-primary

The final log line shows:

[src/extract/subcommand.rs::907][2024-02-23 14:46:26][INFO] processed 42900 reads, 473826 rows, skipped ~300 reads, failed ~0 reads

Indeed if you count the log lines stating the record is skipped:

$ grep "skipping" extract_Y.log| wc -l
# returns 300

So basically, everything is making sense from my side. Keep in mind that without the -Y flag, non-primary alignments cannot be used by modkit (this is documented). It's possible that the records you're seeing in IGV are not primary and don't have base modification information, dorado will omit the MM/ML/MN flags on non-primary alignments.

I've attached user_basecalls_m6A_primary_mapped.bam to this thread, everything else is too big. You should be able to reproduce them with the inputs in the pod5/ folder you hosted on BOX, and you should be able to reproduce the results I have here with the current release of modkit. Let me know if you have any problems.

user_basecalls_m6A_primary_mapped.bam.zip

ArtRand commented 8 months ago

@vetmohit89 any luck?

vetmohit89 commented 8 months ago

Dear @ArtRand ,

My apologies for late response. But it did not work for me. I followed exact same commands you mentioned except

dorado basecaller ${model} ${pod5} \ you have used dorado basecaller ${basecaller} ${pod5} \

I have saved my bam file in the same box folder.

I am not sure what am I missing in bam file. Can you please try to run modkit from bam file which I generated.

When I used bam file you have generated modkit works perfectly fine.

ArtRand commented 8 months ago

@vetmohit89, the file test_m6a_2.bam certainly does not have any MM/ML/MN tags. Could you make sure that you are specifying the rna004_130bps_sup@v3.0.1 basecalling model? Unfortunately, your problem seems to be upstream of modkit (i.e. generating the modbase tags), maybe open an issue with dorado with the exact command you're using and the hardware you have?

vetmohit89 commented 8 months ago

Hello @ArtRand

I have created issue on dorado page:https://github.com/nanoporetech/dorado/issues/671

Thank you Mohit

ArtRand commented 2 months ago

Look like this was taken care of by the Dorado team.