sihaohuanguc / NanoSPA

GNU General Public License v2.0
13 stars 2 forks source link

Description

This protocol is used for simultaneous N6-Methyladenosine (m6A) and pseudouridine (psU, Ψ) sites prediction of nanopore direct RNA sequencing data. There is no minimum input reads requirement but a raw dataset of >1M reads is recommended for the following processing for human transcriptome. For a larger transcriptome, more reads are recommended.

This protocol was tested on a cluster of linux system ("midway2", Scientific Linux 7.2).

Package versions

The version of softwares and packages: Name Version
Python3 3.10.10
guppy_basecaller 3.2.2+9fe0a78 ((C) Oxford Nanopore Technologies, Limited)
minimap2 2.18-r1015
numpy 1.24.3
pandas 1.5.2
samtools 1.11 (Copyright (C) 2020 Genome Research Ltd.)
scikit-learn 1.2.0
tensorflow 2.11.1

Download

You could download the package to your cluster by the following command.

git clone https://github.com/sihaohuanguc/NanoSPA.git

Then go to the folder with the setup.py file. And run

pip3 install .

Now you've installed the package. You could use it at any place of your account. If you are not clear about auy command, you could find help by the command

nanospa -h

Please follow the guidance and obey the rules of your cluster, when running the commands.

Reference and citation

You are always welcome to use the package or edit it in your work. If you would like to apply this package in your work, please cite the following publication:

(place holder)

Protocol

1. Base call

You could base call the reads during sequencing. If so, this step is not necessary and go to the next step directly. If the data is not base called, please use the following command to do the base call.

guppy_basecaller --input_path fast5 \
                 --recursive \
                 --save_path fastq \
                 --records_per_fastq 0 \
                 --flowcell FLO-MIN106 \
                 --kit SQK-RNA002 \
                 --qscore_filtering \
                 --min_qscore 7 \
                 --cpu_threads_per_caller 3 \
                 --num_callers 5

"Input_path" is the path of your raw data in fast5 format. "Save_path" is your output fastq folder. "Flowcell" is the type of nanopore flowcell you use. "Kit" is the version of nanopore direct RNA sequencing kit you use. For more kits and flow cells information, please go to the Oxford Nanopre Technology website. Customize "cpu_threads_per_caller" and "num_caller" according to the state of your own cluster. This step is computation intensive.

2. Alignment and pile up

To align the reads to the reference and pile up the reads, run the following command

nanospa alignment -i path/of/fastq/ -r reference.fa -o your/output/path  # output specified

or

nanospa alignment -i path/of/fastq/ -r reference.fa  # output not specified

The first argument is the input fastq path. The fastq files must be directly in this folder. The second argument is the genome reference file. You could specific the output path by using -o tag. If -o tag is omitted, the output will be a folder named alignment in your current path. The output is a folder with two subfolders plus_strand and minus_strand. The two subfolders contains data of reads aligned to forward and reverse strands respectively. This step is computation intensive, which means it's NOT recommended to run the step on the login node of a cluster or on a local computer.

If you would like to test the package on your device. Please copy the example folder, which contains 200 reads and their corresponding reference sequence, to a folder you like and go to that folder. Then run the following command. (The process of the testing example is not computation intensive and could be run on any device.)

The output report contains the alignment summary generated by minimap2. If everything is correct, you'll get a folder named alignment in you current folder and there is a plus_strand folder in the alignment folder. Check the folder and you'll see the following files.

$ ls alignment/plus_strand/
collect.bam    collect_pile.txt  collect.sorted.bam  reference.fa.fai
collect.fastq  collect.sam       reference.fa

Of course, you could use the .bam file or the _pile.txt file for analysis by other softwares if you want. Here in this example there is no minus_strand folder as all the RNA reads are aligned to the forward strand of the reference. If you align your transcriptome sample to the genome reference, then you'll likely to have both plus_strand and minus_strand folders.

3. Feature extraction

Due to the design of samtools, in the mpileup files, the spliced reads will be filled a ">" or "<" in the jumped regions and the coverage and the quality score are affected. The data points with ">" or "<" are not real bases. Run one of the following commands to remove the gap sections in the mpileup file. This step is computation intensive.

nanospa remove_intron -i your/input/path 

or

nanospa remove_intron

You could choose to specify the input path. If input path is not specified, then the script will view ./alignment as the input. It'll generate a new file named collect_pile_no_intron.txt in the plus_strand and minus_strand folders in your input path.

For the testing example, you'll find a file called collect_pile_no_intron.txt in the plus_strand folder.

Then extract features of all sites. In order to make the prediction more reliable, there is a threshold for a site which is 20 reads. Only sites with >20 reads will be processed for following analysis. This step is computation intensive. Run one of the following commands.

nanospa extract_features -i your/input/path -o your/output/file/name
nanospa extract_features -i your/input/path
nanospa extract_features -o your/output/file/name
nanospa extract_features

You could choose to specify the input path and/or output file name. If input path is not specified, then the script will view ./alignment as the input. If the output file name is not specified, the output file will be features.csv in your input or ./alignment folder, according to your input. This file contains information from reads aligned to both forward and reverse strands.

For the testing example, the features.csv is supposed to contain 822 lines.

For m6A, an additional step is needed to select the A sites within specific motifs and then extract their features. Run one of the following commands.

nanospa preprocess_m6A -i your/input/file/name -o your/output/path
nanospa preprocess_m6A -i your/input/file/name
nanospa preprocess_m6A -o your/output/path
nanospa preprocess_m6A

You could choose to specify the input file and/or output path name. If input file name is not specified, then the script will view ./alignment/features.csv as the input. If the output path is not specified, the output folder will be features/ in the same folder as your input file or ./alignment folder, according to your input. The output file contains information from reads aligned to both forward and reverse strands.

For the testing example, there will be only three files features_AGACT.csv, features_GGACA.csv and features_TGACT.csv in the features folder, as only these 3 of the 8 motifs appear in the example data. For a regular human transcriptome sample, you are supposed to get 8 files for the features of the 8 motifs.

4. psU prediction

To predict the psU probability of all the U sites, run one of the following commands.

nanospa prediction_psU -i your/input/file/name -o your/output/file/name
nanospa prediction_psU -i your/input/file/name
nanospa prediction_psU -o your/output/file/name
nanospa prediction_psU

You could choose to specify the input and/or output file name. If input file name is not specified, then the script will view ./alignment/features.csv as the input. If the output file name is not specified, the output file will be prediction_psU.csv in the same folder as your input file or ./alignment folder, according to your input. Each row contains the reference strand, position on the reference strand (sites on (-) strand have its index on (-) strand), base type, coverage and psU probability.

For the testing example, the prediction.csv is supposed to contain 176 lines. The distribution of psU probability of the 176 sites will look like the histogram below.

Picture1 - hist of test

5. m6A prediction

To predict the m6A probability of all the A sites within GGACU motif, run one of the following commands.

nanospa prediction_m6A -i your/input/path -o your/output/file/name
nanospa prediction_m6A -i your/input/path
nanospa prediction_m6A -o your/output/file/name
nanospa prediction_m6A

You could choose to specify the input path and/or output file name. If input path is not specified, then the script will view ./alignment/features as the input. If the output file name is not specified, the output file will be prediction_m6A.csv in the same folder as your input file or ./alignment folder, according to your input. The results from 8 motifs are combined into one output file. Each row contains the reference strand, position on the reference strand (sites on (-) strand have its index on (-) strand), base type, coverage and m6A probability.

For the testing set, there will be 7 A sites in the prediction_m6A.csv file.

Chrom Position Base Coverage m6A prob
example_F 1388 A 98 0.010548538
example_F 1276 A 80 0
example_F 1282 A 81 0
example_F 1452 A 109 1.50E-26
example_F 1246 A 60 2.37E-23
example_F 1537 A 138 0.027282499
example_F 1806 A 181 1