Known and Novel IsoForm Explorer. Statistically based splicing detection for circular and linear isoforms from RNA-Seq data.
Note: as of release v1.3, circular and linear junctions are quantified based on both R1 and R2 junctional alignments. Prior versions considered only R1 junctional alignments. If both R1 and R2 aligned to the same junction as can be the case for small circles, that read is only counted once in the reported totals.
Note: It has been brought to our attention that there was a discrepancy in the SRA annotations in the NCBI short read archive for the HeLa samples we analyzed for Figure 3 in our manuscript and their true identity. Please see misAnnotationUpdate.zip available in Downloads under release v1.2 for updated analysis.
This statistical algorithm increases the sensitivity and specificy of circularRNA detection from RNA-Seq data by quantifying circular and linear RNA splicing events at both annotated and un-annotated exon boundaries. It can analyze single-end or paired-end reads stored in plain-text fastq or gzipped fastq files. A version of the code that runs on a single Linux machine is provided in circularRNApipeline_Standalone. circularRNApipeline_Cluster contains code that provides the same results, but includes wrapper scripts to run on a cluster using either a SLURM or Sun Grid Engine scheduler. Parameters and output files are described below, and README files within each of these code directories contain detailed instructions for exactly how to run the code.
As for all RNA-Seq analysis tools, our algorithm may perform poorly on low quality data. Therefore, we strongly recommend using our algorithm only on reads that have been processed by a tool such as cutAdapt to trim poor quality ends and adapters as predictions will be adversely affected by low-quality reads.
See INSTALL file for instructions on installing KNIFE.
Annotated junction indices are available for Human (hg19), Mouse (mm10), Rat (rn5) and Drosophila (dm3). We have also packaged up all of the the transcriptome, genome, and ribosomal index, fasta, and gtf files for each of these genomes named as required for use with our scripts. See INSTALL file for directions on installing these genomes.
Code and instructions for creating a new index are provided in createJunctionIndex.
Running the algorithm is slightly different in the Standalone or Cluster implementaions (see README files within those directories for exact usage information), but the parameters are the same. There are a total of 10 possible parameters, the first 5 are required and defaults for the rest are described here:
read_directory: absolute path to directory containing fastq files for alignment. Paired-end reads (PE) must have read1 and read2 in separate files. The file names for a given sample must start with an identical string that identifies the sample and then have either _1, _R1, _2, or _R2 identifying them as read1 or read2. Any part of the file name after this will be ignored. Reads may be in gzipped files or plain text fastq files, but the file extension must be .fq, .fq.gz, .fastq, or .fastq.gz. Read file names must not contain the string 'unaligned' as this will interfere with logic around identifying de novo junctions. For simplicity, single end read files must follow naming requirements for read1 shown below (must have _1 or _R1 in file name). Valid naming examples are:
SAMPLENAME_1.fastq, SAMPLENAME_2.fastq
SAMPLENAME_1.fq.gz, SAMPLENAME_2.fq.gz
SAMPLENAME_R1.fq, SAMPLENAME_R2.fq
read_id_style: complete|appended (use complete for single end).
Note that the read id is only what is up to the first space on the sequence identifier line.
Everything after that is an optional description and is ignored.
So for example, if the sequence identifier lines in your 2 files are:
@SRR445016.1 UNC15-SN850_0140:1:1101:1247:2164/1
@SRR445016.1 UNC15-SN850_0140:1:1101:1247:2164/2
then the read id is SRR445016.1 and the style "complete" should be used
complete: read ids in the fastq files for read 1 and read 2 are identical for mates
appended: the last character of the read id is different in the 2 read files for mates
(ex: @27847_787_327_212/1, @27847_787_327_212/2)
alignment_parent_directory: absolute path to directory where the dataset analysis output and log files will be stored. This directory must already exist, and a directory named dataset_name (see below) will be created under this directory for all output files.
dataset_name: string identifier for this dataset. A folder of this name will be created under alignment_parent_directory (see above) and all output for this run will be stored in this directory.
junction_overlap: minimum number of bases in the read which must be on each side of the junction to consider that the read is overlapping the junction. Values that have empirically worked well are 8 for paired-end (PE) reads of length < 70, 13 for longer PE, and 10 for single-end (SE) reads of length < 70, 15 for longer SE reads.
mode: the mode that the script will be run in. This affects which portions of the code are executed, output file types, etc. Portions of the mode string can be specified in any order. For example, mouse_phred64 is the same as phred64_mouse.
sam (default): output alignments as sam files. bam output is currently not supported
rat|mouse|fly: need to append this string to the mode for the respective organism, otherwise will default to human.
analysis: alignment sam or bam files have already been generated, so just go straight to analysis step
instead of running Bowtie2
phred64: need to append this string to the mode to indicate that quality scores are in phred64,
else phred33 is assumed
report_directory_name: Default circReads. This is the name of the directory created under alignment_parent_directory/dataset_name where the report files listing circular or linear RNA junction read counts and statistical scores will be output.
ntrim: Default 50. When creating the de novo index, this is the number of bases to trim from each end of an unaligned read to try to align independently to the genome. Suggested value is 2n/3 where n is the length of the raw (untrimmed) reads in the dataset.
denovoCircMode: 0|1, default 1. Should we report only circular junctions in the denovo analysis? Changing to 0 means linear denovo junctions are included too. NOTE: denovoCircMode=0 and 1 for the same dataset cannot be run at the same time as there are some dataset-specific files that will get overwritten. If you want to run denovoCircMode=0 and 1 on the same dataset, use a different output directory for the 2 unaligned runs so your files do not get overwritten. If you also want to save the sam files for both unaligned mode runs, you will need to copy orig/denovo and orig/still_unaligned before the 2nd run because these files will be overwritten.
junction_id_suffix: Default none. During the analysis step, the ids of all scrambled junction-overlapping reads are output to alignment_parent_directory/dataset_name/orig/ids/junction and the ids of all canonical-order junction-overlapping reads are output to alignment_parent_directory/dataset_name/orig/ids/reg. Subsequent steps then read these files to report on reads aligning to each junction. If you want to try requiring different amounts of junction overlap in order to compare results, you can specify a string be added to the id directory name so that the original results are not overwritten. If you use this option, also make sure that you use a different report_directory_name so that your reports for this run are output to a different directory instead of overwriting those in the default directory.
All output files can be found under [alignment_parent_directory]/[dataset_name] as specified when the script is called. Within this directory the following subdirectories will be created:
circReads (or the directory will be named as specified by the [report_directory_name] parameter). The primary output files you will be interested in looking at are in the following subdirectories.
circleJuncProbs.txt and linearJuncProbs.txt are the summaries of the GLM reports.
naiveunaligned.txt is the summary of alignments to the denovo index.
If you have generated the combinedReports directory above (this occurs automatically in the circularRNApipeline_Standalone code and requires an additional call in the circularRNApipeline_Cluster code - see circularRNApipeline_Cluster README for details), an additional output directory is created as [alignment_parent_directory]/[dataset_name]Swapped. The output under the "circReads" subdirectories here contains equivalent information as described above for read1, but generated from treating R2 as R1. The output detailed above for the "orig" subdirectories and "sampleStats" is not present because Bowtie2 sam outputs from the initial run are re-used for this step.
True circular and linear junctions are a subset of the reported junctions. All identified junctions are reported along with a posterior probability (GLM reports) or p-value (naive reports) which is used to flag false positives. Instead of selecting a threshold on read counts as you would for other circular RNA algorithms, you will select a threshold on these statistical scores and filter the reports. For the publication, we generally considered junctions with a score of 0.9 or higher to be true positive circular RNA (see publication for details). You can filter the results on read count as well if you want to limit your analysis to highly expressed RNA.
Pre-trimmed fastq files for 1 human sample are provided with the release of v1.1. See the README under testData for instructions to run KNIFE on this sample. Expected results are provided in testData/testOutput.
This code was developed and is maintained by Linda Szabo, lszabo at stanford.edu. Please understand that I am continuing to develop new algorithms, but I will do my best to respond in a timely manner.
Szabo L, Morey R, Palpant NJ, Wang PL, Afari N, Jiang C, Parast MM, Murry CE, Laurent LC, Salzman J. Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development. Genome Biology. 2015, 16:126. http://www.ncbi.nlm.nih.gov/pubmed/26076956