npScarf (jsa.np.npscarf) is a program that scaffolds and completes draft genomes assemblies in real-time with Oxford Nanopore sequencing. The pipeline can run on a computing cluster as well as on a laptop computer for microbial datasets. It also facilitates the real-time analysis of positional information such as gene ordering and the detection of genes from mobile elements (plasmids and genomic islands).
Note: npScarf is not on maintenance anymore, instead npGraph is under development and would be the replacement.
Dependency: The pipeline requires the following software installed
Quick installation guide:
$ git clone https://github.com/mdcao/japsa
$ cd japsa
$ make install \
[INSTALL_DIR=~/.usr/local \]
[MXMEM=7000m \]
[SERVER=true \]
[JLP=/usr/lib/jni:/usr/lib/R/site-library/rJava/jri]
npScarf module is bundled within the Japsa package. Details of installation (including for Windows) and usage of Japsa can be found in its documentation hosted on ReadTheDocs In order to run the npScarf in real-time, npReader and particularly HDF library need to be istalled properly. Please refer to the installation instructions for npReader repository.
This tutorial will walk through how to use npScarf to complete a genome assembly of the K. pnuemoniea ATCC BAA-2146 (Kpn2146) bacterial strain using Illumina and nanopore sequencing data.
Illumina sequencing data: It is essential that the reads are trimmed to remove all adaptors. Low quality bases should also be removed. We make available the sequencing data for the Kpn2146 sample, sequenced with Illumina MiSeq and are trimmed with trimmomatic: file1 and file 2.
Nanopore sequencing data: The raw data (before base-calling) of the Kpn2146 can obtained from ENA with run accession ERR868296.
Intermediate data are also made available as you walk through the tutorial.
$ spades.py --careful --pe1-1 Kp2146_paired_1.fastq.gz --pe1-2 Kp2146_paired_2.fastq.gz -o spades -t 16
The result contigs file of interest is spades/contigs.fasta. The contig list is then sorted with
$ jsa.seq.sort -r -n --input spades/contigs.fasta --output Kp2146_spades.fasta
The assembly of the Illumina data (using SPAdes 3.5) of the Kpn2146 is made available here
$ bwa index Kp2146_spades.fasta
$ bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y Kp2146_spades.fasta Kp2146_ONT.fastq | jsa.np.npscarf -input - -format sam -seq Kp2146_spades.fasta -prefix Kp2146-batch
The nanopore sequencing data for the Kpn2164 sample in fastq format is made available here.
$ jsa.np.npreader --realtime --folder Downloads --fail --stat --number --output - \
| bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 Kp2146_spades.fasta - \
| jsa.np.npscarf -realtime -input - -format sam -seq Kp2146_spades.fasta -prefix Kp2146-realtime > log.out 2>&1
The processing can be distributed over a network cluster by using the streaming utilities
provided in japsa package. Information can be found
here and
here and
examples are here
A summary of npScarf usage can be obtained by invoking the --help option:
jsa.np.npscarf --help
Note: options with dash or dash-dash (GNU style) are all acceptable and equivalent iff no ambiguity is introduced. For example ones can call instead
jsa.np.npscarf -help
or even
jsa.np.npscarf -h
since h is the only prefix in this command's list of options.
WARNING Please always check the help option first before running npScarf since the structure and parameters list of the command can be changed significantly from different versions.
npScarf takes two files as required input:
jsa.np.npscarf -seq <*draft*> -input <*input*> -format sam
<draft> input is the FASTA file containing the pre-assemblies. Normally this is the output from running SPAdes on Illumina MiSeq paired end reads.
<input> contains SAM/BAM formated alignments between <draft> file and <nanopore> FASTA/FASTQ file of long read data. We use BWA-MEM as the recommended aligner with the fixed parameter set as follow:
bwa mem -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y <*draft*> <*nanopore*> > <*bam*>
Starting from our newest versions of npScarf, BWA-MEM is integrated into the command for convenience. Thus the input file is not limitted to SAM/BAM anymore, you can also provide long reads in FASTQ/FASTA format together with BWA-MEM arguments. For example, instead of taking SAM/BAM input data from BWA-MEM explicitly like:
bwa mem -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y <*draft*> <*nanopore*> \
|jsa.np.npscarf -input - -format sam -seq <*draft*> > log.out 2>&1
you can do::
jsa.np.npscarf -bwaExe=</path/to/BWA> -bwaThread=<#threads> -input <*nanopore*> -format fastq -seq <*draft*> > log.out 2>&1
For that reason, it is important to provide the format of the input file if it's in SAM/BAM (default is FASTA/FASTQ). You don't have to specify BWA execution files location if they are already included in your PATH environment variable.
npScarf output is specified by -prefix option. The default prefix is \'out\'. Normally the tool generate two files: prefix.fin.fasta and prefix.fin.japsa which indicate the result scaffolders in FASTA and JAPSA format.
In realtime mode, if any annotation analysis is enabled, a file named prefix.anno.japsa is generated instead. This file contains features detected after scaffolding.
To run npScarf in streaming mode::
jsa.np.npscarf -realtime [options]
In this mode, the <bam> file will be processed block by block. The size of block (number of BAM/SAM records) can be manipulated through option -read and -time.
The idea of streaming mode is when the input <nanopore> file is retrieved in stream. npReader is the module that provides such data from fast5 files returned from the real-time base-calling cloud service Metrichor. Ones can run:
jsa.np.npreader -realtime -folder c:\Downloads\ -fail -output - | \
bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 <*draft*> - 2> /dev/null | \
jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> > log.out 2>&1
or if you have the whole set of Nanopore long reads already and want to emulate the streaming mode:
jsa.np.timeEmulate -s 100 -i <*nanopore*> -output - | \
bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 <*draft*> - 2> /dev/null | \
jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> > log.out 2>&1
Note that jsa.np.timeEmulate based on the field timestamp located in the read name line to decide the order of streaming data. So if your input <nanopore> already contains the field, you have to sort it:
jsa.seq.sort -i <*nanopore*> -o <*nanopore-sorted*> -sortKey=timestamp
or if your file does not have the timestamp data yet, you can manually make ones. For example:
cat <*nanopore*> |awk 'BEGIN{time=0.0}NR%4==1{printf "%s timestamp=%.2f\n", $0, time; time++}NR%4!=1{print}' \
> <*nanopore-with-time*>
The tool includes usecase for streaming annotation. Ones can provides database of antibiotic resistance genes and/or Origin of Replication in FASTA format for the analysis of gene ordering and/or plasmid identifying respectively:
jsa.np.timeEmulate -s 100 -i <*nanopore*> -output - | \
bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y -K 3000 <*draft*> - 2> /dev/null | \
jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> -resistGene <*resistDB.fasta*> -oriRep <*origDB.fasta*> > log.out 2>&1
Or one can input any annotation in GFF 3.0 format:
jsa.np.npscarf -realtime -input - -format sam -seq <*draft*> -genes <*genesList.GFF*> > log.out 2>&1
npScarf can read the assembly graph info from SPAdes for the gap-filling to make the results more precise. This function is still on development and the results might be slightly deviate from the stable version in term of number of final contigs:
jsa.np.npscarf -input <input> -format <format> -seq <*draft*> -spades <spades output folder> > log.out 2>&1
Please cite npScarf if you find it useful for your research
Cao, M.D., Nguyen, H.S., et al. Scaffolding and Completing Genome Assemblies in Real-time with Nanopore Sequencing. Nature Communications 8, Article number: 14515 (2017). doi:[10.1038/ncomms14515].
Data and results from npScarf presented in the paper are made available following this link. The QUAST analysis of results from npScarf and competitive methods are in also presented for K. pneumoniae ATCC BAA-2146, K. pneumoniae ATCC 13883, [E. coli K12 MG1655] (http://data.genomicsresearch.org/Projects/npScarf/results/QUAST/EcK12S/report.html), [S. Typhil H58] (http://data.genomicsresearch.org/Projects/npScarf/results/QUAST/StH58/report.html) and [S. cerevisae W303] (http://data.genomicsresearch.org/Projects/npScarf/results/QUAST/W303/report.html).
See Japsa license