nghiavtr / FuSeq

GNU General Public License v3.0
32 stars 12 forks source link

#####################################

Documents for FuSeq version 1.1.4

#####################################

Update news

05 Apr 2020: release FuSeq_WES

Release FuSeq_WES for fusion detection from DNA sequencing data: https://github.com/nghiavtr/FuSeq_WES

19 Oct 2020: version 1.1.4

1) add "N.A" to column "symbol5" and "symbol3" if they are not found in HGNC 2) update excludeTxVersion.R to remove also the version information of gene from the cdna fasta file 3) generic scripts for a species are provided in ex_createAnno.R and ex_createAnno.sh files in folder createAnno

19 Aug 2020: version 1.1.3

1) speed up the processSplitRead() in the paralogs checking 2) export fragmentDist.txt for the data with a low mapped read rate 3) fix the error when installing from sources caused by the dependency fmtlib mentioned in https://github.com/nghiavtr/FuSeq/issues/1 4) fix the error when running the binary version in Ubuntu

09 Sep 2019: version 1.1.2

1) Fix small bugs in processSplitRead.R and postProcessSplitRead.R 2) Add excludeDiscordantTx.R (details in Section 7) to remove the discordant transcripts which are existing in fasta cdna file but not in gtf file. These discordant transcripts can crash down the split-read pipeline. This step is recommmended for any untested or new annotations such as Homo_sapiens.GRCh38.

13 June 2019: version 1.1.1

1) Speed up functions processFEQ() and detectJunctionBreaks() 2) Add a parameter (exonBoundary) to control junction breaks inside exons 3) Improve clean codes in FuSeq.cpp

14 Nov 2018: version 1.1.0

1) Add the description of how to generate the annotation files for new species 2) Minor changes in output data of FuSeq

01 Oct 2018: version 1.0.0

1) Correct the headers of split reads when exporting their sequences to file

23 Jul 2018: version 0.1.1

1) Error when running with Ubuntu 16 from librt.so.1: fixed by simply removing the librt.so.1 from the library folder of FuSeq

2) Allow inverted fusion genes by default: there are existing both fusion genes A-B and B-A maxInvertedFusionCount=0

3) Optimize memory and space processFEQ.R: remove myFusion as a redundant data FuSeq_functions.R: fix some bugs of memory usage in function computeGeneDistance()

4) Export more information in the output fusions.FuSeq

21 May 2017: version 0.1.0

1. Introduction

FuSeq is a novel method to discover fusion genes from paired-end RNA sequencing data. FuSeq discovers fusion genes based on quasi-mapping to quickly map the reads, extract initial candidates from split reads and fusion equivalence classes of mapped reads, and finally apply multiple filters and statistical tests to get the final candidates. Full details of the method is described in the method publication (1).

Software requirements:

FuSeq is implemented in R and C++. We acknowledge for materials from Sailfish, Rapmap and other tools used in this software.

Annotation reference

FuSeq requires

1) a fasta file of transcript sequences and a gtf file of transcript annotation: can be downloaded from public repositories such as Ensembl (ensembl.org) 2) a RData file of supporting annotation: A description of how to create the RData file for new annotation versions or species is available in Section 7.

Current FuSeq version was tested on the human transcriptome with ensembl annotation version GRCh37.75. Following files are required:

We also prepare annotation RData files for several annotations from Ensembl (ensembl.org) and example codes to build the annotation RData file in folder createAnno. Please see Section 7 for further information.

Versions

The latest version and information of FuSeq are updated at https://github.com/nghiavtr/FuSeq

The older versions can be found here:

2. Download and installation

If you use the binary verion of FuSeq:

Do not forget to replace "/path/to/" by your local path.

3. Index

The command for indexing transcript fasta is similar to the indexing of sailfish and rapmap. For example:

TxIndexer -t /path/to/transcripts.fa -o TxIndexer_idx

If the transcript names in the transcript fasta file contains version data but the gtf file does not, users should remove the version information from the transcript names as described in Section 7

K-mer length for a short-read RNA-seq

In order to be able to capture split reads, the k-mer length must be less than a half of read length. Thus, the default k-mer length (31) of rapmap/sailfish is not proper for a short-read RNA-seq dataset, e.g 50 bp long. For the short-read RNA-seq dataset, we use k-mer length of 21 instead as the follow:

TxIndexer -t /path/to/transcripts.fa -k 21 -o TxIndexer_idx

4. Extract fusion equivalence classes and split reads

This step requires a annotation file (gtf_file) in the gtf format to get the map between gene and transcript.

5. Discover fusion genes

FuSeq uses R scripts in FuSeq_home/R directory for this step. The transcript fasta (/path/to/transcripts.fa in the index step) and two additional files are required:

Now, we can use another R script to discover fusion genes.

Rscript FuSeq_home/R/FuSeq.R in=/path/to/feqDir txfasta=/path/to/transcripts.fa sqlite=/path/to/sqlite_file txanno=/path/to/txanno_file out=FuSeq_outDir params=/path/to/params_file

Herein, the settings for "params" and "out" are optional. If the params_file is not supplied, the default parameter setting will be used. Details about parameter setting are presented in next section. If FuSeq_outDir is not input, FuSeq will save the output into the same directory of the fusion equivalence classes (/path/to/feqDir). Since this step is usually fast, parallel is not implemented.

Output of FuSeq

Fusion genes discovered by FuSeq are stored in a file named fusions.fuseq containing information of the fusion genes:

A log file (FuSeq_logs.RData) contains information of parameter (FuSeq.params), input data path (inPath) and the results of final integration step (FuSeq.integration). FuSeq.integration contains the final results of MR pipeline, SR pipeline and combination.

- If exportFasta=TRUE is set in the params_file, FuSeq will export fusion reads supporting fusion genes in two fasta files named *_fusionReads_1.fa and *_fusionReads_2.fa for read1 and read2, respectively.
- If keepRData=TRUE is set in the params_file, FuSeq will store results of each steps of the pipeline in *.RData files.

6. Parameter setting

There are several parameters input for the pipeline. The default parameter setting is available at FuSeq_home/R/params.txt. Details of these parameters are refered to main publication, herein they are shortly described as the below:

7. Create a supporting annotation RData file

In this section, we describe details how to create a annotation RData for your own annotation. We use the annotations from Ensembl (ensembl.org) as example.

Requirements:

Concordances of transcript/gene names in the fasta file and the gft file of Ensembl annotation

In some recent annotations from Ensembl (e.g., GRCh38 of human), versions are also included in the transcript and gene names of the fasta file (cdna) of transcript sequences. However, it is not consistent with the corresponding gtf file where there are no version information in the names. To solve this issue, we exclude the version information in transcript names and gene names from the cdna fasta file (transcripts.fa) using excludeTxVersion.R:

Rscript FuSeq_home/R/excludeTxVersion.R /path/to/transcripts.fa /path/to/transcripts.cleanversion.fa

Furthermore, there might have also discordant transcripts that they are existing in the fasta cdna file but not the gtf file. We remove these discordant transcripts from the fasta cdna file using excludeDiscordantTx.R:

#creat the sqlite file
Rscript FuSeq_home/R/createSqlite.R transcript.gtf transcript.sqlite 
#remove discordant transcripts from the cdna fasta file
Rscript FuSeq_home/R/excludeDiscordantTx.R  cdna=/path/to/transcripts.cleanversion.fa sqlite=transcript.sqlite out=/path/to/transcripts.clean.fa

The fasta cdna file after removing the version information and discordant transcripts (transcripts.clean.fa) should be used for both indexing step and/or creating RData annotation file in downstream.

Generation of an annotation RData file

We use the transcripts.fa (or the transcripts.clean.fa from the previous step, if the transcript/gene names are not consistent) to generate the annotation RData file. This includes three main steps:

1) extract information of transcripts such as transcript name, gene name, chromosom etc.

2) using biomart to get gene paralog and other information The information of gene paralog is compulsory, all the other information is optional. The data are kept in the objects of:

3) simulate data and find relations between transcripts For simplicity, in this manual, we use the R-package "polyester" to simulate data of all transcripts, then run GenTC to generate data for discovering sequence similarities between two genes.

Some remarks in Step 2 before use:

Example codes for creating an annotation RData

We also prepare example codes in folder createAnno to create a new supporting annotation RData file for

1) the annotation from an animal (Homo sapiens version GRCh38.94) with inconsistent transcript/gene names between fasta file and gtf file. 2) the annotation from a plant (Arabidopsis thaliana version TAIR10.41) with consistent transcript/gene names. 3) generic scripts for a species are provided in ex_createAnno.R and ex_createAnno.sh files in folder createAnno. Please read carefully the instruction in this section before use.

Note that "polyester" R-package might require a huge of memory for generating the simulated data for big transcriptome. In our implementation, we assign a memory of 32GB for TAIR10.41 and 64GB for GRCh38.94.

8. Pratical examples of using FuSeq

In this section, we introduce how to use FuSeq by a step-by-step tutorial using several public real RNA-seq samples. This aims to test FuSeq by just doing a copy-and-paste of the example commands. For simplicity, in this practice, the FuSeq software, the annotation, RNA-seq data samples, and the results of fusion detection are put togetther in the same (current) folder.

8.1. Download and install

Download and configure FuSeq

wget https://github.com/nghiavtr/FuSeq/releases/download/v1.1.4/FuSeq_v1.1.4_linux_x86-64.tar.gz -O FuSeq_v1.1.4_linux_x86-64.tar.gz
tar -xzvf FuSeq_v1.1.4_linux_x86-64.tar.gz
cd FuSeq_v1.1.4_linux_x86-64
bash configure.sh
cd ..

Set paths to FuSeq

export LD_LIBRARY_PATH=$PWD/FuSeq_v1.1.4_linux_x86-64/linux/lib:$LD_LIBRARY_PATH
export PATH=$PWD/FuSeq_v1.1.4_linux_x86-64/linux/bin:$PATH

8.2. Download and prepare the reference files

Download the fasta and gtf of transcripts

wget http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh37.75.cdna.all.fa.gz
wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
gunzip Homo_sapiens.GRCh37.75.gtf.gz

Create sqlite

Rscript FuSeq_v1.1.4_linux_x86-64/R/createSqlite.R Homo_sapiens.GRCh37.75.gtf Homo_sapiens.GRCh37.75.sqlite 

Download the extra transcript information and annotation from FuSeq

wget https://github.com/nghiavtr/FuSeq/releases/download/v0.1.0/Homo_sapiens.GRCh37.75.txAnno.RData

8.3. Parameter setting

The default of parameter setting is located at FuSeq_v1.1.4_linux_x86-64/R/params.txt that we will use for the pratical examples. For more advanced-level users, we suggest running FuSeq with the setting of keepRData=TRUE to keep the processed data of FuSeq, then FuSeq will save all data into file FuSeq_process.RData. This file contains the results of both mapped read pipeline and split read pipeline, and extra relevant information of fusion gene candidates such as supporting exons, read mapping positions, sequence reads, etc.

8.4. An example for a short read sample

We select a breast cancer sample from the KPL-4 cell line. This sample contains short reads (50bp), a small library size (6.8 M read pairs) that is suitable for testing purpose. For this dataset, to generate split reads, the default k-mer length of 31 is not appropriate. We use k-mer length of 21 instead. For data downloading and fusion gene detection, it takes around 0.5 hour in total using a single cpu. The time can vary due to download speeds as well.

Download dataset

wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR064/SRR064287/SRR064287_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR064/SRR064287/SRR064287_2.fastq.gz

Indexing fasta sequences

Note: because of short read, a small k-mer length k=21 (k=31 by default) is used for this dataset. The index reference can be reused for other short read samples

TxIndexer -t Homo_sapiens.GRCh37.75.cdna.all.fa -o TxIndexer_idx_k21 -k 21

Extract fusion equivalence classes and split reads

FuSeq -i TxIndexer_idx_k21 -l IU -1 <(gunzip -c SRR064287_1.fastq.gz) -2 <(gunzip -c SRR064287_2.fastq.gz) -p 1 -g Homo_sapiens.GRCh37.75.gtf -o SRR064287_feqDir 

Discover fusion genes

Rscript FuSeq_v1.1.4_linux_x86-64/R/FuSeq.R in=SRR064287_feqDir txfasta=Homo_sapiens.GRCh37.75.cdna.all.fa sqlite=Homo_sapiens.GRCh37.75.sqlite txanno=Homo_sapiens.GRCh37.75.txAnno.RData out=SRR064287_FuseqOut params=FuSeq_v1.1.4_linux_x86-64/R/params.txt

The results is a list of gene-fusion candidates stored in file fusions.FuSeq in the output folder (SRR064287_FuseqOut).

8.5. An example for a long read sample

Now we apply FuSeq for a long read sample from a glioma dataset. The selected sample SRR934746 contains 24.4 M read pairs with 100bp read long. For this dataset, the default k-mer length of 31 is used. It takes around 01-02 hours for both downloading data and detecting fusion gene candidates using a single cpu. The time also varies depending on download speeds.

Download dataset

wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR934/SRR934746/SRR934746_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR934/SRR934746/SRR934746_2.fastq.gz

Indexing fasta sequences

The default k-mer length (k=31) is used for reference indexing. The index reference can be reused for other long read samples

TxIndexer -t Homo_sapiens.GRCh37.75.cdna.all.fa -o TxIndexer_idx_k31 -k 31

Extract fusion equivalence classes and split reads

FuSeq -i TxIndexer_idx_k31 -l IU -1 <(gunzip -c SRR934746_1.fastq.gz) -2 <(gunzip -c SRR934746_2.fastq.gz) -p 1 -g Homo_sapiens.GRCh37.75.gtf -o SRR934746_feqDir 

Discover fusion genes

Rscript FuSeq_v1.1.4_linux_x86-64/R/FuSeq.R in=SRR934746_feqDir txfasta=Homo_sapiens.GRCh37.75.cdna.all.fa sqlite=Homo_sapiens.GRCh37.75.sqlite txanno=Homo_sapiens.GRCh37.75.txAnno.RData out=SRR934746_FuseqOut params=FuSeq_v1.1.4_linux_x86-64/R/params.txt

The results is a list of fusion gene candidates stored in file fusions.FuSeq in the output folder (SRR934746_FuseqOut).

9. License

FuSeq uses GNU General Public License GPL-3

10. References

  1. Vu, Trung Nghia, Wenjiang Deng, Quang Thinh Trac, Stefano Calza, Woochang Hwang, and Yudi Pawitan. 2018. “A Fast Detection of Fusion Genes from Paired-End RNA-Seq Data.” BMC Genomics 19 (1): 786. https://doi.org/10.1186/s12864-018-5156-1.