marcpaga / esox

MIT License
9 stars 0 forks source link

8-oxo-dG detection using nanopore sequencing

This repository contains the code for the 8-oxo-dG detection model using nanopore sequencing data. This entails two models:

For more information, please read our pre-print.

Limitations

It is important to understand the limitations of this model to avoid misinterpretation of the results, please do not ignore this section.

Installation

The 8-oxo-dG calling consists of two steps. Please install the dependencies as follows.

conda create -n esox_env python=3.7
conda activate esox_env
conda install -c bioconda ont-tombo  # this might take a while
pip install -r requirements.txt

For a full list of dependencies see: conda.txt, dependencies in requirements.txt are installed via pip.

Usage

For demo data, here is a small dataset that can be used to test the model. The data is already basecalled using Guppy/Dorado, and the raw data is in the demo/fast5 folder. The basecalled data is in the demo/fastq folder. Example outputs are in demo/basecall_out and demo/modcall_out.

Download link

Basecalling

First, we have to basecall the raw data (.fast5 files) using our basecalling model. You will also need the already basecalled (.fastq files) from Guppy/Dorado. This will generate a .fastq file with the basecalled sequences, as well as a .npz file that can be used as input for our second model.

The scripts expectes equally named .fast5 and .fastq files in the input folders, see the demo folder for examples.

conda activate esox_env

python3 scripts/basecall.py \
--fast5-path demo/fast5 \
--fastq-path demo/fastq \
--output-path demo/basecall_out \
--model-file static/models/bonito.pt \
--progress-bar \
--device cuda:0 \
--demo

This is the slowest step, not using a GPU will make this step very slow. If you feel this is too slow, consider dividing the input data into smaller chunks and running them in parallel using a pipeline (e.g. Snakemake).

If you get the following error, see StackOverflow on how to solve it:

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found

Modification calling

After basecalling, we can use the .fastq and .npz file generated in the previous step to evaluate the basecalled Gs and determine if they are 8-oxo-dG.

conda activate esox_env # if not already activated

python3 scripts/modcall.py \
--input-path demo/basecall_out \
--output-path demo/modcall_out \
--model-file static/models/remora.pt \
--progress-bar \
--device cuda:0

Again, please check static/kmer_performance.txt to decide what threshold to use per 5-mer based on the FP rate and your 8-oxo-dG expected abundance. In our work we used a threshold of 0.95 for most 5-mers. 5-mers not in the training dataset will have 0 fp and 0 fn.

Why is it called esox?

Most nanopore tools have fish names, and esox is the genus of the pike fish, which ends in "ox", as in oxidation.

Citation

If you use our model, please cite our pre-print:

Marc Pagès-Gallego, Daan M.K. van Soest, Nicolle J.M. Besselink, Roy Straver, Janneke P. Keijer, Carlo Vermeulen, Alessio Marcozzi, Markus J. van Roosmalen, Ruben van Boxtel, Boudewijn M.T. Burgering, Tobias B. Dansen, Jeroen de Ridder
bioRxiv 2024.05.17.594638; doi: https://doi.org/10.1101/2024.05.17.594638