Adding dorado as source in live mode

chinmaysharmacs10 commented 7 months ago

Oxford Nanopore Technologies (ONT) have integrated their new basecaller Dorado into MinKNOW (the controller software on thier devices, including MinION). I noticed that we must provide source (-s) as input while running the live mode of Sturgeon. However, currently we only have Guppy/Megalodon as options.

To keep the model up-to-date with the recent developments with ONT, it would be great if Dorado is added as a source in the live mode.

Given that Dorado also outputs bam files and the underlying service architecture is very similar to Guppy, the code for adding Dorado as source should be fairly straightforward.

marcpaga commented 7 months ago

Hi @chinmaysharmacs10,

We depend on modbam2bed to be able to extract methylation calls from bam files. Since this project been deprecated, we now recommend the use of modkit, which is also developed by ONT and should be up-to-date with Dorado.

chinmaysharmacs10 commented 7 months ago

Hi @marcpaga,

Thank you for your reply.

Modkit will convert the bam files we get from Dorado to bed files. But I think bed files don't work in the live mode directly.

I did try using bed files with live mode. However, I suppose that live mode only looks for bam files in the input folder and returns the message "Looking for new bam files, so far found 0" if there is no bam file.

I used the bed example files in the demo folder, which is used for the predict mode example, to run live mode. I issued the following command: sturgeon live \ -i demo/bed \ -o demo/bed/results_live/ \ -s guppy \ --model-files ./sturgeon/include/models/general.zip \ --probes-file ./venv/lib/python3.9/site-packages/sturgeon/include/static/probes_chm13v2.bed \ --plot-results

Let me know if you have any inputs on this, or on steps to use bed files in the live mode.

Thanks!

marcpaga commented 7 months ago

I understand your problem now. It's a bit complicated to keep the live feature in the future, since we depend that for every new bam file modkit is called to process it.

My recommendation would be that you write yourself a script that checks a folder for bam files and then calls modkit extract (see readme for a bit more detail), then calls sturgeon inputtobed -s modkit, and finally calls sturgeon predict.

We are currently using this approach, and will likely leave the live feature as legacy only for megalodon and guppy.

chinmaysharmacs10 commented 7 months ago

Thank you for your suggestions @marcpaga :) This is exactly the approach I was thinking.

chinmaysharmacs10 commented 4 months ago

Hi @marcpaga,

Hope you are doing well.

I created the script like you suggested and also managed to get some pod5 files generated by the Minion device. These files are of brain tumor samples. However, I am getting an empty bed file after running the inputtobed command.

These are the steps in my pipeline:

Generating bam file from pod5 files using command --> "dorado basecaller hac,5mCG_5hmCG pod5_folder_path > bam_file_path"
Modkit file from bam file using command --> "modkit extract bam_file_path modkit_txt_file_path"
Converting modkit file to bed file using command --> "sturgeon inputtobed -i modit_file_path -o bed_file_directory -s modkit"
Finally, running strugeon predict using command --> "sturgeon predict -i bed_file_directory -o output_directory --model_files model_path --plot_results"

I am unable to understand why I am getting an empty bed file. The pod5 files have been tested for methylation presence by others and have methylated CpG sites.

Maybe I am issuing incorrect commands. I would really appreciate your help in resolving this.

Thanks, Chinmay

marcpaga commented 4 months ago

Hi @chinmaysharmacs10,

From your dorado command my guess is that the data is not mapped. The reads have to be mapped so that we know which CpG sites are which. Check the alignment section https://github.com/nanoporetech/dorado?tab=readme-ov-file#alignment for commands on how to align your data.

Also very imporant, align the data to the T2T reference genome for best results: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz

If this does not solve the issue, could you please paste perhaps the top 20 rows of the output of modkit, maybe that can help me see what would be the problem.

chinmaysharmacs10 commented 4 months ago

Thank you for pointing that out Marc. I missed aligning my bam files.

Yes, now with the data aligned to the T2T reference genome, I am able to get good CpG site coverage.

Appreciate your help, and will reach out if I have more questions :) Your model has me really excited and I wish to leverage in this end-to-end pipeline.

marcpaga / sturgeon

Adding dorado as source in live mode #11