This protocol is used for simultaneous N6-Methyladenosine (m6A) and pseudouridine (psU, Ψ) sites prediction of nanopore direct RNA sequencing data. There is no minimum input reads requirement but a raw dataset of >1M reads is recommended for the following processing for human transcriptome. For a larger transcriptome, more reads are recommended.
This protocol was tested on a cluster of linux system ("midway2", Scientific Linux 7.2).
The version of softwares and packages: Name | Version |
---|---|
Python3 | 3.10.10 |
guppy_basecaller | 3.2.2+9fe0a78 ((C) Oxford Nanopore Technologies, Limited) |
minimap2 | 2.18-r1015 |
numpy | 1.24.3 |
pandas | 1.5.2 |
samtools | 1.11 (Copyright (C) 2020 Genome Research Ltd.) |
scikit-learn | 1.2.0 |
tensorflow | 2.11.1 |
You could download the package to your cluster by the following command.
git clone https://github.com/sihaohuanguc/NanoSPA.git
Then go to the folder with the setup.py
file. And run
pip3 install .
Now you've installed the package. You could use it at any place of your account. If you are not clear about auy command, you could find help by the command
nanospa -h
Please follow the guidance and obey the rules of your cluster, when running the commands.
You are always welcome to use the package or edit it in your work. If you would like to apply this package in your work, please cite the following publication:
(place holder)
You could base call the reads during sequencing. If so, this step is not necessary and go to the next step directly. If the data is not base called, please use the following command to do the base call.
guppy_basecaller --input_path fast5 \
--recursive \
--save_path fastq \
--records_per_fastq 0 \
--flowcell FLO-MIN106 \
--kit SQK-RNA002 \
--qscore_filtering \
--min_qscore 7 \
--cpu_threads_per_caller 3 \
--num_callers 5
"Input_path" is the path of your raw data in fast5 format. "Save_path" is your output fastq folder. "Flowcell" is the type of nanopore flowcell you use. "Kit" is the version of nanopore direct RNA sequencing kit you use. For more kits and flow cells information, please go to the Oxford Nanopre Technology website. Customize "cpu_threads_per_caller" and "num_caller" according to the state of your own cluster. This step is computation intensive.
To align the reads to the reference and pile up the reads, run the following command
nanospa alignment -i path/of/fastq/ -r reference.fa -o your/output/path # output specified
or
nanospa alignment -i path/of/fastq/ -r reference.fa # output not specified
The first argument is the input fastq path. The fastq files must be directly in this folder. The second argument is the genome reference file. You could specific the output path by using -o
tag. If -o
tag is omitted, the output will be a folder named alignment
in your current path. The output is a folder with two subfolders plus_strand
and minus_strand
. The two subfolders contains data of reads aligned to forward and reverse strands respectively. This step is computation intensive, which means it's NOT recommended to run the step on the login node of a cluster or on a local computer.
If you would like to test the package on your device. Please copy the example
folder, which contains 200 reads and their corresponding reference sequence, to a folder you like and go to that folder. Then run the following command. (The process of the testing example is not computation intensive and could be run on any device.)
The output report contains the alignment summary generated by minimap2. If everything is correct, you'll get a folder named alignment
in you current folder and there is a plus_strand
folder in the alignment
folder. Check the folder and you'll see the following files.
$ ls alignment/plus_strand/
collect.bam collect_pile.txt collect.sorted.bam reference.fa.fai
collect.fastq collect.sam reference.fa
Of course, you could use the .bam
file or the _pile.txt
file for analysis by other softwares if you want. Here in this example there is no minus_strand
folder as all the RNA reads are aligned to the forward strand of the reference. If you align your transcriptome sample to the genome reference, then you'll likely to have both plus_strand
and minus_strand
folders.
Due to the design of samtools, in the mpileup files, the spliced reads will be filled a ">" or "<" in the jumped regions and the coverage and the quality score are affected. The data points with ">" or "<" are not real bases. Run one of the following commands to remove the gap sections in the mpileup file. This step is computation intensive.
nanospa remove_intron -i your/input/path
or
nanospa remove_intron
You could choose to specify the input path. If input path is not specified, then the script will view ./alignment
as the input. It'll generate a new file named collect_pile_no_intron.txt
in the plus_strand
and minus_strand
folders in your input path.
For the testing example, you'll find a file called collect_pile_no_intron.txt
in the plus_strand
folder.
Then extract features of all sites. In order to make the prediction more reliable, there is a threshold for a site which is 20 reads. Only sites with >20 reads will be processed for following analysis. This step is computation intensive. Run one of the following commands.
nanospa extract_features -i your/input/path -o your/output/file/name
nanospa extract_features -i your/input/path
nanospa extract_features -o your/output/file/name
nanospa extract_features
You could choose to specify the input path and/or output file name. If input path is not specified, then the script will view ./alignment
as the input. If the output file name is not specified, the output file will be features.csv
in your input or ./alignment
folder, according to your input. This file contains information from reads aligned to both forward and reverse strands.
For the testing example, the features.csv
is supposed to contain 822 lines.
For m6A, an additional step is needed to select the A sites within specific motifs and then extract their features. Run one of the following commands.
nanospa preprocess_m6A -i your/input/file/name -o your/output/path
nanospa preprocess_m6A -i your/input/file/name
nanospa preprocess_m6A -o your/output/path
nanospa preprocess_m6A
You could choose to specify the input file and/or output path name. If input file name is not specified, then the script will view ./alignment/features.csv
as the input. If the output path is not specified, the output folder will be features/
in the same folder as your input file or ./alignment
folder, according to your input. The output file contains information from reads aligned to both forward and reverse strands.
For the testing example, there will be only three files features_AGACT.csv
, features_GGACA.csv
and features_TGACT.csv
in the features
folder, as only these 3 of the 8 motifs appear in the example data. For a regular human transcriptome sample, you are supposed to get 8 files for the features of the 8 motifs.
To predict the psU probability of all the U sites, run one of the following commands.
nanospa prediction_psU -i your/input/file/name -o your/output/file/name
nanospa prediction_psU -i your/input/file/name
nanospa prediction_psU -o your/output/file/name
nanospa prediction_psU
You could choose to specify the input and/or output file name. If input file name is not specified, then the script will view ./alignment/features.csv
as the input. If the output file name is not specified, the output file will be prediction_psU.csv
in the same folder as your input file or ./alignment
folder, according to your input. Each row contains the reference strand, position on the reference strand (sites on (-) strand have its index on (-) strand), base type, coverage and psU probability.
For the testing example, the prediction.csv is supposed to contain 176 lines. The distribution of psU probability of the 176 sites will look like the histogram below.
To predict the m6A probability of all the A sites within GGACU motif, run one of the following commands.
nanospa prediction_m6A -i your/input/path -o your/output/file/name
nanospa prediction_m6A -i your/input/path
nanospa prediction_m6A -o your/output/file/name
nanospa prediction_m6A
You could choose to specify the input path and/or output file name. If input path is not specified, then the script will view ./alignment/features
as the input. If the output file name is not specified, the output file will be prediction_m6A.csv
in the same folder as your input file or ./alignment
folder, according to your input. The results from 8 motifs are combined into one output file. Each row contains the reference strand, position on the reference strand (sites on (-) strand have its index on (-) strand), base type, coverage and m6A probability.
For the testing set, there will be 7 A sites in the prediction_m6A.csv
file.
Chrom | Position | Base | Coverage | m6A prob |
---|---|---|---|---|
example_F | 1388 | A | 98 | 0.010548538 |
example_F | 1276 | A | 80 | 0 |
example_F | 1282 | A | 81 | 0 |
example_F | 1452 | A | 109 | 1.50E-26 |
example_F | 1246 | A | 60 | 2.37E-23 |
example_F | 1537 | A | 138 | 0.027282499 |
example_F | 1806 | A | 181 | 1 |