This package contains of two parts. The first part is the pipeline for retraining a model to basecall and rectify basecalling errors at telomeres. The second part consists of a pipeline that can be directly applied to basecall and fix basecallinge errors at telomeric regions.
There are a series of four steps to apply the tuned basecalling model to the telomeric nanopore reads. These steps and the corresponding scripts can be found in the following directories.
To apply the bonito basecalling model, you will need the following software in your environment.
To apply the full pipeline in a single step, one can use the command
perl 1_apply_model/fullpipeline.pl <input_fasta> <fast5_directory_of_nanopore_signal_data> <output_label>
A description of these arguments are as follows:
Otherwise, the pipeline can also be applied by following each of the following steps.
This directory contains the tuned basecalling model for bonito. The model can be downloaded from the following path (https://zenodo.org/record/5819148/files/chm13_nanopore_trained_run225.zip?download=1) and unzip into this folder.
This directory contains a set of scripts used to identify the problematic telomeric reads for basecalling. Specifically, the scripts will identify long-reads with a high freqeuncy of telomeric repeats and telomeric repeat artefacts to redo the base calls. A list of readnames corresponding to the candidate reads will then be produced
To apply this step, run the following command:
perl main.pl <input_fasta> <output_label>
This directory contains a set of scripts to extract the fast5 for the required reads. Subsequent to that, we will basecall these reads using the tuned bonito model.
Note that fast5_subset from https://github.com/nanoporetech/ont_fast5_api is required to run this step. Also note that the extraction of the desired reads with fast5_subset can take some time.
To apply this step, run the following command:
perl main.pl <directory_of_all_fast5_files> <file_of_readnames_for_required_reads> <directory_to_output_extracted_fast5> <output_fastagz_file>
This step allows one to replace the reads with the basecalling repeat artefacts with the reads that have been fixed.
To apply this step, run the following command:
perl find_and_replace_fasta.pl <original_full_fasta_file> <rebasecalled_fasta_file>
There are a series of three steps if one would like to retrain the bonito basecalling model for telomeric regions.
This step generates the unmodified data needed for generating new training data. Also, the ground truth sequence is also extracted from the reference genome in this step.
To apply this step, run the following command:
perl main.pl <fast5_of_reads_from_telomeres> <original_bonito_model> <training_data_output_label> <reference_genome_fasta>
This step modifies the training data (chunks.npy) using the ground truth data. This allows the modified training data to now be used by the bonito software to tune the original basecalling model.
To apply this step, run the following command:
perl main.pl <training_data_label>
This step retrains the bonito basecalling model with a low learning rate. A new basecalling model that is optimized for telomeric region will be generated after this step.
To apply this step, run the following command:
perl main.pl <original_basecalling_model> <training_data_directory> <tuned_model_name>
Note that the tuned bonito basecalling model that we have published is compatible with Bonito v0.3.5, but not with higher versions of Bonito (e.g. v0.5 and above). Please check that your are using the correct version of Bonito. I will also be updating the tuned model for later versions of Bonito in the future after I have properly evaluated them.
If you have further queries, please contact me at the following email address.
Kar-Tong Tan (ktan@broadinstitute.org)