NGSpeciesID is a tool for clustering and consensus forming of long-read amplicon sequencing data (has been used with both PacBio and Oxford Nanopore data). The repository is a modified version of isONclust, where consensus, primer-removal, and polishing feautures have been added.
NGSpeciesID is distributed as a python package supported on Linux / OSX with python v3.6. .
NOTE: If you are experiencing issues (e.g. this one) with the third party tools spoa or medaka in the all-in-one installation instructions below, please install the tools manually with their respective installation instructions here and here.
Conda is the preferred way to install NGSpeciesID.
conda create -n NGSpeciesID python=3.6 pip
conda activate NGSpeciesID
conda install --yes -c conda-forge -c bioconda medaka==0.11.5 openblas==0.3.3 spoa racon minimap2
pip install NGSpeciesID
NGSpeciesID --help
Upon start/login to your server/computer you need to activate the conda environment "NGSpeciesID" to run NGSpeciesID as:
conda activate NGSpeciesID
Activate conda environment
conda activate NGSpeciesID
Make a new directory and navigate to it
mkdir test_ngspeciesID
cd test_ngspeciesID
Download the test fastq file called "sample_h1.fastq" (filesize 390kb)
curl -LO https://raw.githubusercontent.com/ksahlin/NGSpeciesID/master/test/sample_h1.fastq
NGSpeciesID --ont --fastq sample_h1.fastq --outfolder ./sample_h1 --consensus --medaka
NGSpeciesID needs a fastq file generated by an Oxford Nanopore basecaller.
NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output]
The argument --ont
simply means --k 13 --w 20
. These arguments can be set manually without the --ont
flag. Specify number of cores with --t
.
NGSpeciesID can also run with racon as polisher. For example
NGSpeciesID --ont --consensus --racon --racon_iter 3 --fastq [reads.fastq] --outfolder [/path/to/output]
will polish the consensus sequences with racon three times.
NGSpeciesID employs quality filtering of the reads based on read Phred scores. However, we recommend also removing reads much shorter or longer than the intended target, which often represent chimeras or contaminations. This can be done by specifying the --m (intended target length)
and --s (maximum deviation from target length)
. NGSpeciesID also has the feature of subsampling reads using parameter --sample_size
. Altogether, if we want to filter out reads outside the length interval [700,800] and using a subset of 300 reads (if the dataset consists of more reads) we could run
NGSpeciesID --ont --sample_size 300 --m 750 --s 50 --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output]
By default, length filtering and subsampling are not invoked if parameters are not specified.
If customized primers are to be expected in the reads thay can be detected and removed. The primer file is expected to be in fasta format. Here is an example of a primer file:
>MCB869_ONT_R
CGATCAATCCCCTAACAAACTAGG
>MCB398_ONT_F
TACCATGAGGACAAATATCATTCTG
NGSpeciesID searches for primes in a window of Xbp (parameter, default 150bp) at the beginning and end of each consensus.
Trimming of primers is performed after consensus forming and can be invoked as
NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --primer_file [primers.fa]
NGSpeciesID
can also remove universal tails. Trimming of tails is performed after consensus forming and can be invoked as
NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --remove_universal_tails
The two options are mutually exclusive, i.e., only one of them can be run.
The output consists of the polished consensus sequences along with some information about clustering.
final_clusters.tsv
present in the specified output folder.In the cluster TSV-file, the first column is the cluster ID and the second column is the read accession. For example:
0 read_X_acc
0 read_Y_acc
...
n read_Z_acc
if there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first).
The bioinformatics workflow below was developed as part of a step-by-step protocol for field-deployable DNA amplicon sequencing with the Oxford Nanopore Technologies MinION. The full protocol manuscript is in submission; a link will be posted here when available. The steps below correspond to step numbers in the protocol.
barcode_generator
. This software uses Python3.python3 barcode_generator_3.4.py none 24 40 8
Here, the parameters are set as:
After lab steps are complete:
These commands use the fast basecalling model from Guppy.
Basecalling for R9.4 flow cell:
guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings
Basecalling and filter reads by quality score (here, set to 7):
guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings --min_qscore 7
Basecalling for R10.3 flow cell:
guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r10.3_450bps_fast.cfg --recursive --disable_pings
cat *.fastq > sequencing_reads.fastq
NanoPlot --fastq_rich sequencing_reads.fastq -o sequencing_run -p sequencing_run
Example files can be found in:
The example files Supplementary Data 1 can be used for sequencing_reads.fastq
and Supplementary Data 2 can be used for indexes.txt
.
python minibar.py indexes.txt sequencing_reads.fastq -T -F -e 3 -E 11
Here, the edit distance allowed between indexes (-e
) is set to 3 base pairs and the edit distance allowed between primer sequences (-E
) is set to 11 base pairs.
guppy_barcoder -i sequencing_reads.fastq -s demultiplex_folder --trim_barcodes --disable_pings
For a single sample (using example primer file):
NGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq barcode0.fastq --outfolder barcode0_consensus
Here, the parameters are set as:
--ont
)--consensus
)--sample_size
) = 500 reads subsampled per sample to analyze --m
) = 800 base pairs--s
) = 100 base pairs--medaka
)--primer_file
is given, NGSpeciesID will check to remove any remaining primer sequence. The example primer file is available in Supplementary Data 3. The primers were developed in Mikkelsen, P.M., Bieler, R., Kappner, I., & Rawlings, T.A. (2006). Phylogeny of Veneroidea (Mollusca: Bivalvia) based on morphology and molecules. Zoological Journal of the Linnean Society, 148(3), 439-521.--fastq
(output from step B5)--outfolder
To run this step on more than one sample, use a bash script with a for loop:
for file in *.fastq; do
bn=`basename $file .fastq`
NGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq $file --outfolder ${bn}
done
This loop uses the wildcard *
to indicate you want to analyze all files with the .fastq
extension and assumes the command is run from the directory that contains the read files (if not, be sure to change the file path: path/to/*.fastq
).
This loop code can be entered at a UNIX/Mac terminal (be sure the spacing/indentation is correct) or saved as a script (see consensus.sh
. The script should be run from the terminal and in the directory that contains the read files as:
./consensus.sh
makeblastdb -in database.fasta -dbtype nucl -out database
blastn -db database -query barcode0_consensus.fasta -outfmt 6 -out barcode0_consensus_blast.out
Check the results and refine the search or database as needed to better identify the sequence identity of your samples!
Please cite [1] when using NGSpeciesID.
GPL v3.0, see LICENSE.txt.