Nanopore telomere basecalling

This package contains of two parts. The first part is the pipeline for retraining a model to basecall and rectify basecalling errors at telomeres. The second part consists of a pipeline that can be directly applied to basecall and fix basecallinge errors at telomeric regions.

Appling tuned model to the telomeric nanopore reads

There are a series of four steps to apply the tuned basecalling model to the telomeric nanopore reads. These steps and the corresponding scripts can be found in the following directories.

Dependencies

To apply the bonito basecalling model, you will need the following software in your environment.

Python
Perl
fast5_subset (included as part of the ont_fast5_api package at https://github.com/nanoporetech/ont_fast5_api)
Bonito basecaller (current pipeline was tested on Bonito v0.3.5 which can be obtained from https://github.com/nanoporetech/bonito/releases). Please also ensure that the Bonito basecaller is working before applying this pipeline as Bonito is dependent on a number of other packages (e.g. CuPy, CUDA, etc.). Please refer to the Bonito repository for detailed information on the required packages.
Samtools (http://www.htslib.org/)

Applying full pipeline

To apply the full pipeline in a single step, one can use the command

perl 1_apply_model/fullpipeline.pl <input_fasta> <fast5_directory_of_nanopore_signal_data> <output_label>

A description of these arguments are as follows:

- Pre-called fasta files that you can generate using either Guppy or the Bonito basecaller. (This pipeline only re-basecalls the telomeric reads. So you will still need to generate fasta files from your raw fast5 files using either Guppy or the default Bonito caller)
- This specifies the folder where your fast5 files are. The required fast5 files to re-basecall are extracted from this directory
- Any name or output path that you so desire.

Otherwise, the pipeline can also be applied by following each of the following steps.

1. bonito_basecalling_model

This directory contains the tuned basecalling model for bonito. The model can be downloaded from the following path (https://zenodo.org/record/5819148/files/chm13_nanopore_trained_run225.zip?download=1) and unzip into this folder.

2. identify_problematic_reads

This directory contains a set of scripts used to identify the problematic telomeric reads for basecalling. Specifically, the scripts will identify long-reads with a high freqeuncy of telomeric repeats and telomeric repeat artefacts to redo the base calls. A list of readnames corresponding to the candidate reads will then be produced