TAGET user manual

TAGET is a computational toolkit that provides a wide spectrum of tools for analyzing full-length transcriptome data. Based on its highly precise transcript alignment and junction prediction, TAGET enables accurate novel isoform, gene fusion detection, and expression quantification analyses

Environmental dependence

HISAT2/MINIMAP2/GMAP at least one
samtools
python3
R>=3.3
Linux centos
Python packages:rpy2,pandas,numpy
R packages:stringr,optparse,DEGseq

FAST RUN

python TransAnnot.py -f [fasta] -g [genome fasta] -o [output directory] -a[annot gtf] -p [process] --use_minimap2 [1] --use_hisat2 [hisat2 index]

or you can use

python TransAnnot.py -c TransAnnot.Config

Running time

The running time is about less than 1 hours with 8 core on a Linux server

software running

1.the config file contain environmental path of each software and the index file of the reference genome

you can set the following parameters at the first time
- the path of HISAT2/Minimap2
- the index file of reference genome
- reference genome(FASTA)、anotiation of transcript file default Ensemble（GTF）、process number
After setting the base parameters,you can set the fasta file of the full length transcript and the output dictionary or you can use -c config and -f [fatsa] -o [output]
The reads\transcript\gene expression can be caculated by the parameter of --tpm

running result

The output files contain the following files:

[{sample_id}.annot.bed]() the bed format with annotated genes
[{sample_id}.annot.stat]() the annotations of each transcripts
[{sample_id}.annot.db.pickle]() the input file of visualization
[{sample_id}.annot.cluster.gene]() the cluster of genes
[{sample_id}.annot.cluster.transcript]() the cluster of transcript
[{sample_id}.annot.cluster.reads]() the cluster of reads
[{sample_id}.annot.junction]() the information of splice junction file
[{sample_id}.annot.multiAnno]() muliti-annotation transcript

{sample_id}.annot.stat each coloumn：

ID： reads ID
Classification： classificatioin of reads
Subtype: subtype of reads
Gene: gene annotation or region in genome[chr1:100000-100500]
Transcript: transcript annotation
Chrom: chromosome
Strand: strand
Seq_length: reads length
Seq_exon: exon number of reads
Ref_length: length of transcript annotation
Ref_exon_num: exon number of transcript annotatioin
diff_to_gene_start: 5` site difference of reads and annotation gene in reference genome
diff_to_gene_end: 3` site difference of reads and annotation gene in reference genome
diff_to_transcript_start: 5` site difference of reads and annotation transcript in reference genome
diff_to_transcript_end: 3` site difference of reads and annotation transcript in reference genome
exon_miss_to_transcript_start: number of exon missed in 5` site between reads and transcript annotation
exon_miss_to_transcript_end: number of exon missed in 3` site between reads and transcript annotation

TransAnnot.Config

FASTA: [path]，input file,fasta format of full length transcript
OUTPUT_DIR: [path]，the output dictionary
GENOME_FA: [path]，the fasta file of reference genome (eg,hg38.fa)
GTF_ANNOTATION:[path]，the annotion file of gene default gtf format
PROCESS: [int],the number of process
SAMPLE_UNIQUE_NAME:[string],the output prefix of each files
PYTHON:[path]，the pathway of python
TAGET_DIR:[path]，the pathway of TAGET
SAMTOOLS:[path]，the pathway of samtools
USE_HISAT2: [int]，wether or not use HISAT2, 1 means use,0 means not use
HISAT2: [path],the pathway of Hisat2
HISAT2_INDEX: [path]，the pathway of index of Hisat2 ，generated by hisat2-build
USE_MINIMAP2: [int]，wether or not use minimap2, 1 means use,0 means not use
MINIMAP2: [path]，the pathway of Minimap2
USE_GMAP: [int]，wether or not use GMAP, 1 means use,0 means not use
GMAP: [path]，the pathway of GMAP
GMAP_INDEX: [path],the pathway of GMAP index,generated by gmap_build
TPM_LIST: [path]，the expression of Isoform
READ_LENGTH: [int]，the read length used by HISAT2，default 100
READ_OVERLAP: [int] the read overlap used by HISAT2， default 80
MIN_READ_LENGTH: [int]，the minimum length of read，default 30

TransAnnotMerge

We can use TransAnnotMerge to generate expression matrix of multi-samples

FAST RUN

extract isoform expression from fasta file： python fa2exp.py -f [fa] -i [prefix] -o [exp] -p [taget output dictionary]

-f: full length transcript fasta format file
-o: output dictionary
-p: prefix of {sample_uniqe_name}.anno files

python script.py input.config

This step needs to use exon.gtf file,which can be unzipped by using unzip exon.gtf.zip

Usage of TransAnnotMerge

python TranAnnotMerge -c MergeConfig -o outputdir -m [TPM/FLC/None]

-c： Merge Config，consist of four coloumn,sample ID() ，[{sample_id}.annot.stat]()，[{sample_id}.annot.bed]()，[{sample_id}.annot.db.pickle]()。

#sample	stat	bed	db
-------	----	---	--

-o: the output dictionary
-m：the gene and transcript expression displayed by different methods,FLC: full length count，if none is not expression matrix

TransAnnotMerge running

extract isoform expression from fasta file：python fa2exp.py -f [fa] -o [exp]
running TranAnnotMerge: python TranAnnotMerge -c MergeConfig -o outputdir -m TPM

TransAnnotMerge result file

{sample_id}.reads.exp： the read expression of each file
{sample_id}.transcript.exp： the transcript expression of each file
{sample_id}.gene.exp： the gene expression of each file
gene.exp： the gene expression matrix of each sample
transcript.exp： the transcirpt expression matrix of each sample
merge.db.pickle： view transcirpt
DIU analysis

python expression_V1.py -t {sample_id}.transcript.exp -g {sample_id}.gene.exp -o {prefix}
-t transcript expression of tumor and normal
-g gene expression of tumor and normal
-o prefix of outfile
-r default 0.05 filter the low express transcript
-p default 50 filter the low express gene
the classfication of isoform annotated by TAGET

the classfication of transcript

FSM: full splice site match
ISM: incomplete splice site match
NIC: novel in catalog
NNC: novel not in catalog
GENIC: genic
INTERGENIC: intergenic
FUSION: fusion
UNKNOWN： unknown

the classfication of exon

KE: known exon
LEKE: left end known exon
REKE: right end known exon
NEKSLE: novel exon with known splice site in left end exon and has the unique region overlap with at least two known exons
NEKSRE: novel exon known splice site in right end exon and has the unique region overlap with at least two known exons
IE: intron retention: two known splice sites from the same transcript's sequential exon
NEDT: novel exon with two known splice sites from different transcript
NELS: novel exon with novel left splice site
NERS: novel exon with novel right splice site
LEE: left exon_extension： the novel splice site in the left end of the exon which is longer than any exons overlap with it
REE: right exon_extension: the novel splice site in the right end of the exon which is longer than any exons overlap with it
NEDS: novel exon:double novel splice sites overlap with at least one known exon
NEIG: novel exon inner-gene：novel exon inside the gene and without any overlap with known exon
NEOG: novel exon inter-gene：novel exon outside the gene
NELE: novel exon with novel splice site in the far left exon
NERE: novel exon with novel splice site in the far right exon
MDNS: monoexon with double novel splice sites

TAGET gene fusion detect and filter

Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l ${i}.fa.minimap2.bed -s ${i}.fa.hisat2.bed -a ${i}.fa.anno.tmp.stat -t hg38.gtf -f ${i}.fa -n ${i} -o ./output
${i}.fa.long.bed: the bed file mapped by minimap2 generated by TAGET
${i}.fa.short.bed:the bed file mapped by HISAT2 generated by TAGET
${i}.fa: the CCS read from Pacbio platform
${i}: the prefix of generated file name

An example of TAGET

We can download data.7z,demo.7z and script.7z to run the demo, the running time is about less than 1hours with 8 core on a Linux server.Details can be seen example.readme. Reference genome can be downlaod from https://disk.pku.edu.cn:443/link/1F62976F65C4EA81C4C06A05E245049D

Reference genome

Here we used hg38 to annotate transcripts,human reference hg38.fa and Ensemble gtf format files were needed. HISAT2 and minimap2 need to index this reference. The pickle file of gtf format file can be generated by using gtf_db_make.py python gtf_db_make.py hg38.ensembl.gtf hg38.ensembl.v20200306.1.pickle

Demos

dependency

HISAT2 v.2.2.1 MINIMAP2 v2.24 samtools v1.19 python3.9 R3.5 Linux cento OS7 Python packages: rpy2 v3.3.3 pandas v1.2.3 numpy v1.22.0 R packages: stringr v1.5.0 optparse v1.7.3 DEGseq v1.12

TAGET user manual

Environmental dependence

HISAT2/MINIMAP2/GMAP at least one
samtools
python3
R>=3.3
Linux cento OS
Python packages:rpy2,pandas,numpy
R packages:stringr,optparse,DEGseq

FAST RUN

python TransAnnot.py -f [fasta] -g [genome fasta] -o [output directory] -a[annot gtf] -p [process] --use_minimap2 [1] --use_hisat2 [hisat2 index]

or you can use

python TransAnnot.py -c TransAnnot.Config

Running time

The running time is about less than 1 hours with 8 core on a Linux server

software running

1.the config file contain environmental path of each software and the index file of the reference genome

you can set the following parameters at the first time
- the path of HISAT2/Minimap2
- the index file of reference genome
- reference genome(FASTA)、anotiation of transcript file default Ensemble（GTF）、process number
After setting the base parameters,you can set the fasta file of the full length transcript and the output dictionary or you can use -c config and -f [fatsa] -o [output]
The reads\transcript\gene expression can be caculated by the parameter of --tpm

running result

The output files contain the following files:

[{sample_id}.annot.bed]() the bed format with annotated genes
[{sample_id}.annot.stat]() the annotations of each transcripts
[{sample_id}.annot.db.pickle]() the input file of visualization
[{sample_id}.annot.cluster.gene]() the cluster of genes
[{sample_id}.annot.cluster.transcript]() the cluster of transcript
[{sample_id}.annot.cluster.reads]() the cluster of reads
[{sample_id}.annot.junction]() the information of splice junction file
[{sample_id}.annot.multiAnno]() muliti-annotation transcript

{sample_id}.annot.stat each coloumn：

ID： reads ID
Classification： classificatioin of reads
Subtype: subtype of reads
Gene: gene annotation or region in genome[chr1:100000-100500]
Transcript: transcript annotation
Chrom: chromosome
Strand: strand
Seq_length: reads length
Seq_exon: exon number of reads
Ref_length: length of transcript annotation
Ref_exon_num: exon number of transcript annotatioin
diff_to_gene_start: 5` site difference of reads and annotation gene in reference genome
diff_to_gene_end: 3` site difference of reads and annotation gene in reference genome
diff_to_transcript_start: 5` site difference of reads and annotation transcript in reference genome
diff_to_transcript_end: 3` site difference of reads and annotation transcript in reference genome
exon_miss_to_transcript_start: number of exon missed in 5` site between reads and transcript annotation
exon_miss_to_transcript_end: number of exon missed in 3` site between reads and transcript annotation

TransAnnot.Config

FASTA: [path]，input file,fasta format of full length transcript
OUTPUT_DIR: [path]，the output dictionary
GENOME_FA: [path]，the fasta file of reference genome (eg,hg38.fa)
GTF_ANNOTATION:[path]，the annotion file of gene default gtf format
PROCESS: [int],the number of process
SAMPLE_UNIQUE_NAME:[string],the output prefix of each files
PYTHON:[path]，the pathway of python
TAGET_DIR:[path]，the pathway of TAGET
SAMTOOLS:[path]，the pathway of samtools
USE_HISAT2: [int]，wether or not use HISAT2, 1 means use,0 means not use
HISAT2: [path],the pathway of Hisat2
HISAT2_INDEX: [path]，the pathway of index of Hisat2 ，generated by hisat2-build
USE_MINIMAP2: [int]，wether or not use minimap2, 1 means use,0 means not use
MINIMAP2: [path]，the pathway of Minimap2
USE_GMAP: [int]，wether or not use GMAP, 1 means use,0 means not use
GMAP: [path]，the pathway of GMAP
GMAP_INDEX: [path],the pathway of GMAP index,generated by gmap_build
TPM_LIST: [path]，the expression of Isoform
READ_LENGTH: [int]，the read length used by HISAT2，default 100
READ_OVERLAP: [int] the read overlap used by HISAT2， default 80
MIN_READ_LENGTH: [int]，the minimum length of read，default 30

TransAnnotMerge

We can use TransAnnotMerge to generate expression matrix of multi-samples

FAST RUN

extract isoform expression from fasta file： python fa2exp.py -f [fa] -i [prefix] -o [exp] -p [taget output dictionary]

-f: full length transcript fasta format file
-o: output dictionary
-p: prefix of {sample_uniqe_name}.anno files

python script.py input.config

This step needs to use exon.gtf file,which can be unzipped by using unzip exon.gtf.zip

Usage of TransAnnotMerge

python TranAnnotMerge -c MergeConfig -o outputdir -m [TPM/FLC/None]

-c： Merge Config，consist of four coloumn,sample ID() ，[{sample_id}.annot.stat]()，[{sample_id}.annot.bed]()，[{sample_id}.annot.db.pickle]()。

#sample	stat	bed	db
-------	----	---	--

-o: the output dictionary
-m：the gene and transcript expression displayed by different methods,FLC: full length count，if none is not expression matrix

TransAnnotMerge running

extract isoform expression from fasta file：python fa2exp.py -f [fa] -o [exp]
running TranAnnotMerge: python TranAnnotMerge -c MergeConfig -o outputdir -m TPM

TransAnnotMerge result file

{sample_id}.reads.exp： the read expression of each file
{sample_id}.transcript.exp： the transcript expression of each file
{sample_id}.gene.exp： the gene expression of each file
gene.exp： the gene expression matrix of each sample
transcript.exp： the transcirpt expression matrix of each sample
merge.db.pickle： view transcirpt
DIU analysis

python expression_V1.py -t {sample_id}.transcript.exp -g {sample_id}.gene.exp -o {prefix}
-t transcript expression of tumor and normal
-g gene expression of tumor and normal
-o prefix of outfile
-r default 0.05 filter the low express transcript
-p default 50 filter the low express gene
the classfication of isoform annotated by TAGET

the classfication of transcript

FSM: full splice site match
ISM: incomplete splice site match
NIC: novel in catalog
NNC: novel not in catalog
GENIC: genic
INTERGENIC: intergenic
FUSION: fusion
UNKNOWN： unknown

the classfication of exon

KE: known exon
LEKE: left end known exon
REKE: right end known exon
NEKSLE: novel exon with known splice site in left end exon and has the unique region overlap with at least two known exons
NEKSRE: novel exon known splice site in right end exon and has the unique region overlap with at least two known exons
IE: intron retention: two known splice sites from the same transcript's sequential exon
NEDT: novel exon with two known splice sites from different transcript
NELS: novel exon with novel left splice site
NERS: novel exon with novel right splice site
LEE: left exon_extension： the novel splice site in the left end of the exon which is longer than any exons overlap with it
REE: right exon_extension: the novel splice site in the right end of the exon which is longer than any exons overlap with it
NEDS: novel exon:double novel splice sites overlap with at least one known exon
NEIG: novel exon inner-gene：novel exon inside the gene and without any overlap with known exon
NEOG: novel exon inter-gene：novel exon outside the gene
NELE: novel exon with novel splice site in the far left exon
NERE: novel exon with novel splice site in the far right exon
MDNS: monoexon with double novel splice sites

TAGET gene fusion detect and filter

Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l ${i}.fa.minimap2.bed -s ${i}.fa.hisat2.bed -a ${i}.fa.anno.tmp.stat -t hg38.gtf -f ${i}.fa -n ${i} -o ./output
${i}.fa.long.bed: the bed file mapped by minimap2 generated by TAGET
${i}.fa.short.bed:the bed file mapped by HISAT2 generated by TAGET
${i}.fa: the CCS read from Pacbio platform
${i}: the prefix of generated file name

An example of TAGET

We can download data.7z,demo.7z and script.7z to run the demo, the running time is about less than 1hours with 8 core on a Linux server.Details can be seen example.readme. Reference genome can be downlaod from https://disk.pku.edu.cn:443/link/5E0152D82C71B992690CFA9D7A3B5CF8 https://disk.pku.edu.cn:443/link/5E0152D82C71B992690CFA9D7A3B5CF8 https://zenodo.org/records/10091914

Reference genome

Here we used hg38 to annotate transcripts,human reference hg38.fa and Ensemble gtf format files were needed. HISAT2 and minimap2 need to index this reference. The pickle file of gtf format file can be generated by using gtf_db_make.py python gtf_db_make.py hg38.ensembl.gtf hg38.ensembl.v20200306.1.pickle

Demos

dependency

HISAT2 v.2.2.1
MINIMAP2 v2.24
samtools v1.19
python3.9
R3.5
Linux centos7
Python packages:
rpy2 v3.3.3
pandas v1.2.3
numpy v1.22.0
R packages:
stringr v1.5.0
optparse v1.7.3
DEGseq v1.12

1 Fast run

python TransAnnot.py -c 759133C.Config python TransAnnot.py -c 759133N.Config running time:72 minutes outputs: the dictionary of 759133C 759133C.minimap2.bed 759133C.hisat2.bed 759133C.annot.bed 759133C.annot.stat 759133C.annot.db.pickle 759133C.annot.cluster.gene 759133C.annot.cluster.transcript 759133C.annot.cluster.reads 759133C.annot.junction 759133C.annot.multiAnno 759133C.anno.tmp.stat

the dictionary of 759133N 759133N.minimap2.bed 759133N.hisat2.bed 759133N.annot.bed 759133N.annot.stat 759133N.annot.db.pickle 759133N.annot.cluster.gene 759133N.annot.cluster.transcript 759133N.annot.cluster.reads 759133N.annot.junction 759133N.annot.multiAnno 759133N.anno.tmp.stat

2.TransAnnotMerge

python fa2exp.py -f 759133C.fa -i 759133C -o 759133C -p ./expression python fa2exp.py -f 759133N.fa -i 759133N -o 759133N -p ./expression python script.py input.config running time:35 minutes outputs 759133.reads.exp 759133.transcript.exp

3.DIU analysis python expression_V1.py -t 759133.transcript.exp -g 759133.gene.exp -o 759133 running time:2 minutes outputs: 759133_DIU.txt

4 gene fusion

Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l 759133C.minimap2.bed -s 759133C.hisat2.bed -a 759133C.fa.anno.tmp.stat -t hg38.gtf -f 759133C.fa -n 759133C -o ./output running time:18 minutes output 759133C.fusion

gx-health / TAGET

readme

TAGET user manual

Environmental dependence

FAST RUN

Running time

software running

running result

{sample_id}.annot.stat each coloumn：

TransAnnot.Config

TransAnnotMerge

FAST RUN

Usage of TransAnnotMerge

TransAnnotMerge running

TransAnnotMerge result file

DIU analysis

the classfication of isoform annotated by TAGET

the classfication of transcript

the classfication of exon

TAGET gene fusion detect and filter

An example of TAGET

Reference genome

Demos

dependency

TAGET user manual

Environmental dependence

FAST RUN

Running time

software running

running result

{sample_id}.annot.stat each coloumn：

TransAnnot.Config

TransAnnotMerge

FAST RUN

Usage of TransAnnotMerge

TransAnnotMerge running

TransAnnotMerge result file

DIU analysis

the classfication of isoform annotated by TAGET

the classfication of transcript

the classfication of exon

TAGET gene fusion detect and filter

An example of TAGET

Reference genome

Demos

dependency