nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
139 stars 7 forks source link

bam index problem #81

Closed hyunjokoo closed 11 months ago

hyunjokoo commented 11 months ago

Hello,

Question 1.

I prepared bam file from pod5 file.

This is how I prepared bam file. [hjk@PC24B090:/data_ngs3/hjk/data/nanopore_dorado_test]$ dorado basecaller /data_ngs3/hjk/utils/dorado/dorado-0.4.2-linux-x64/dna_r10.4.1_e8.2_400bps_sup@v4.2.0 nanopore_raw_2023-11-08_sup/ --modified-bases 5mCG_5hmCG > call_sup_5mCG_5hmCG.bam [2023-11-10 11:28:44.891] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v3.1 with httplib [2023-11-10 11:28:51.850] [info] > Creating basecall pipeline [2023-11-10 11:29:11.691] [warning] Maximum safe estimated batch size is only 64. [2023-11-10 11:29:11.693] [warning] Maximum safe estimated batch size is only 64. [2023-11-10 11:29:11.715] [info] - set batch size for cuda:0 to 64 [2023-11-10 11:29:11.725] [info] - set batch size for cuda:1 to 64 [2023-11-10 11:33:59.227] [info] > Simplex reads basecalled: 4003 [2023-11-10 11:33:59.227] [info] > Basecalled @ Samples/s: 1.007219e+06 [2023-11-10 11:38:21.760] [info] > Finished

Then, I tried to use modkit. [hjk@ngs7:/data_ngs3/hjk/data/nanopore_dorado_test]$ /data_ngs2/hjk/utils/modkit/modkit_v0.2.2_centos7_x86_64/modkit pileup call_sup_5mCG_5hmCG.bam call_sup_5mCG_5hmCG.bam_modkit-pileup.bed --log-filepath call_sup_5mCG_5hmCG.bam_modkit-pileup.log

Error! unable to open SAM/BAM/CRAM index for call_sup_5mCG_5hmCG.bam; please create an index

So I indexed bam file, and ran again. [hjk@ngs7:/data_ngs3/hjk/data/nanopore_dorado_test]$ samtools index call_sup_5mCG_5hmCG.bam

[hjk@ngs7:/data_ngs3/hjk/data/nanopore_dorado_test]$ /data_ngs2/hjk/utils/modkit/modkit_v0.2.2_centos7_x86_64/modkit pileup call_sup_5mCG_5hmCG.bam call_sup_5mCG_5hmCG.bam_modkit-pileup.bed --log-filepath call_sup_5mCG_5hmCG.bam_modkit-pileup.log

calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently creating directory at "" attempting to sample 10042 reads Error! zero reads found in bam index

It said that my bam index is wrong. Could you please explain the detail how I should prepare the required input files for modkit?

Question 2

The downloaded modkit executable file name has centos7. Is it ok to run modkit in Ubuntu, too? For now both Centos and Ubuntu server gave me the same error message.

Thank you.

HJK

ArtRand commented 11 months ago

Hello @hyunjokoo,

Is it possible that all of your reads are unmapped? Could you try

samtools view -F 4 call_sup_5mCG_5hmCG.bam | wc -l 

The recommended way to map the reads is with dorado aligner docs. If you don't have a reference sequence to map to, you can use modkit extract docs to inspect the base modification information in your modBAM.

Hope this helps, happy to answer any additional questions.

The downloaded modkit executable file name has centos7. Is it ok to run modkit in Ubuntu, too? For now both Centos and Ubuntu server gave me the same error message.

Yes, the build is for Centos because it uses an older and more portable libc.

hyunjokoo commented 11 months ago

Thank you for your quick reply.

1. As you suggested, I checked it, and there was 0 mapped read.

I used the whole data (about 285 G of pod5 file size) for dorado basecaller to find methylated site, but I could not find any mapped read from the modBAM file.

Is it normal that I cannot find any methylated site? Two tested samples were from insect and plant.

2. I have another question. You mentioned dorado aligner, but I feel that dorado aligner does not help to find methylated site. Is it correct? I think dorado aligner is just like what we can do minimap2 using the fastq sequence converted from pod5 files.

3. Could you please tell me if there is any downloadable example pod5 file (hac or sup) that I can test to detect methylated sites using dorado?

Thank you.

HJK

ArtRand commented 11 months ago

Hello @hyunjokoo,

If you're interested in finding methylated bases in a reference genome, you need to map your reads onto the reference. I recommend using dorado aligner for this task because it will not drop or corrupt the MM and ML BAM tags that contain the modified base information. Once you have this aligned BAM, you can use it with modkit pileup and get a table of how many reads report methylation of each type at each reference position.

If, on the other hand, you don't have a reference genome, you can use modkit extract and modkit summary to see how much methylation is detected in your sample.

You cannot find methylated positions without mapping to a reference genome, however, you can only get read-level information (with extract) or summary statistics.

If this isn't clear, maybe you could you help me understand what you're trying to do.

hyunjokoo commented 11 months ago

Thank you for your explanation. Now I understood that I need to align to a reference, first to do modkit pileup. If you see my first command in my first thread, no reference was used at that time. That's why modkit pileup failed.

Is there any good site that explains MM and ML tags? Or could you please explain them?

Thank you.

ArtRand commented 11 months ago

Hello @hyunjokoo,

There is a specification document here that explains the MM and ML tags.

hyunjokoo commented 11 months ago

Thank you!!

nad302 commented 8 months ago

Hi I am having the exact same error come up. I have basecalled my pod5s using

dorado basecaller hac,5mC_5hmC,6mA ~/chem10/data --kit-name SQK-NBD114-24 >~/chem10/results/test_calls.bam

demultiplexed using

dorado demux --output-dir ~/chem10/results/ --no-classify ~/chem10/results/test_calls.bam

and aligned using:

dorado aligner -Y ~/chem10/references/pDEL152.fa ~/chem10/results/SQK-NBD114-24_barcode01.bam > ~/chem10/results/BC1_aligned.bam

But upon trying to pileup using:

modkit pileup ~/chem10/results/BC1_aligned.bam ~/chem10/results/BC1pileup.bed --log-filepath ~/chem10/results/pileupBC1.log

I get the error

Error! unable to open SAM/BAM/CRAM index for /home/strathsci/chem10/results/BC1_aligned.bam; please create an index

During alignent I can see 557 reads basecalled and only 5 of these are listed as unmapped so I believe there should be an index. The sequencing is of a short 5.7kb plasmid if that makes a difference.