nanoporetech / pinfish

Tools to annotate genomes using long read transcriptomics data
Other
44 stars 13 forks source link
cdna genome-annotation nanopore rna-seq transcriptomics

ONT_logo

We have a new bioinformatic resource that largely replaces the functionality of this project! See our new repository here: https://github.com/nanoporetech/pipeline-nanopore-ref-isoforms

The improved spliced_bam2gff tool is released at https://github.com/nanoporetech/spliced_bam2gff

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: support@nanoporetech.com for help with your application if it is not possible to upgrade to our new resources, or we are missing key features.

pinfish

Pinfish is a collection of tools helping to make sense of long transcriptomics data (long cDNA reads, direct RNA reads). The toolchain is composed of the following tools:

Pinfish is largely inspired by the Mandalorion pipeline. It is meant to provide a quick way for generating annotations from long reads only and it is not meant to provide the same functionality as pipelines using a broader strategy for annotation (such as LoReAn).

The pinfish tools can be run via a Snakemake pipeline which handles the alignment tasks using minimap2.

Getting Started

Installation

The static linux binaries for the x86_64 platform are included in the respective subdirectories of the source tree. To install them simply copy them somewhere in your path.

The polish_clusters tool depends on the following software:

Dependencies and compiling from source

Compiling the tools from source require a working go compiler installation and the following packages installed via go get:

After installing dependencies simply issue make in the respective subdirectory.

Usage

spliced_bam2gff

Usage of spliced_bam2gff:
  -M    Input is from minimap2.
  -V    Print out version.
  -g    Use strand tag as feature orientation then read strand if not available.
  -h    Print out help message.
  -s    Use read strand (from BAM flag) as feature orientation.
  -t int
        Number of cores to use. (default 4)

The tool is looking by default for the XS tag in order to determine transcript orientation, unless the -M flag is specified in which case it is assumed that the input is from minimap2 and the ts tag is used instead (with different rules to determine the final orientation).

If no orientation tag is found, then the orientation is set to ., unless the -g flag is provided, in which case the read orientation from the BAM flag is used.

If the -s flag is specified all the rules above are ignored and the orientation is set to the read strand from the BAM flag (appropriate for stranded protocols).

Example run with minimap2 input:

spliced_bam2gff -M minimap_sorted.bam > raw_transcripts.gff

Example run with minimap2 input, stranded mode:

spliced_bam2gff -s minimap_sorted.bam > raw_transcripts.gff

Example run with GMAP input:

spliced_bam2gff gmap_sorted.bam > raw_transcripts.gff

cluster_gff

Usage of ./cluster_gff:
  -V    Print out version.
  -a string
        Write clusters in tabular format in this file.
  -c int
        Minimum cluster size. (default 10)
  -d int
        Exon boundary tolerance. (default 10)
  -e int
        Terminal exons boundary tolerance. (default 30)
  -h    Print out help message.
  -p float
        Minimum isoform percentage. (default 1)
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)

The -e parameter is the maximum distance tolerated at the start of the first exon and the end of last exon, while -d is the tolerance for all other exon boundaries.

Transcript clusters having size less than the -c parameter are discarded. This parameter has the largest effect on the sensitivity and specificity of transcript reconstruction. Larger values usually lead to higher specificity at the expense of lowering sensitivity.

Example run with default minimum cluster size and tolerance values:

cluster_gff -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

Example run with custom parameters:

cluster_gff -c 5 -e 50 -d 5 -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

polish_clusters

Usage of ./polish_clusters:
  -V    Print out version.
  -a string
        Read cluster memberships in tabular format.
  -c int
        Minimum cluster size. (default 1)
  -d string
        Location of temporary directory.
  -h    Print out help message.
  -m    Do not load all reads in memory (slower).
  -o string
        Output fasta file.
  -t int
        Number of cores to use. (default 4)
  -x string
        Arguments passed to minimap2.
  -y string
        Arguments passed to racon.

Example run:

polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 40 sorted.bam

The resulting consensus transcripts can be mapped to the genome using minimap2.

collapse_partials

Usage of ./collapse_partials:
  -M    Discard monoexonic transcripts.
  -U    Discard transcripts which are not oriented.
  -V    Print out version.
  -d int
        Internal exon boundary tolerance. (default 5)
  -e int
        Three prime exons boundary tolerance. (default 30)
  -f int
        Five prime exons boundary tolerance. (default 5000)
  -h    Print out help message.
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)

The -d parameter is the exon boundary difference tolerated at internal splice sites, while -e and -f are the tolerance values at the 3' and 5' end respectively. Transcripts which are not oriented are all assigned to distinct "loci" and left untouched by default (but see the -U flag).

Example run:

collapse_partials -d 10 -e 35 -f 1000 input.gff > collapsed_output.gff

Running tests

For running tests the following dependencies have to be installed:

Both are easy to install using bioconda. Look into the Makefiles for targets testing the tools on simulated and real data.

Help

Licence and Copyright

(c) 2018 Oxford Nanopore Technologies Ltd.

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

FAQs and tips

References and Supporting Information

See the post announcing the tool at the Oxford Nanopore Technologies community here.