Trinity

Trinity — 无参转录组从头组装 by 生信菜鸟团

工欲善其事必先利其器

Trinity是由Broad 研究所和希伯来大学联合开发的一种利用RNA-seq数据进行高效、稳健地从头组装转录本的方法，包含三个独立的软件模块：

Inchworm （虫）（C++）

将 reads切为 k-mers (k bp长度的短片段)
利用Overlap关系对k-mers进行延伸 (贪婪算法)
输出所有的序列 (“contigs”)

Chrysalis （蛹）（C++）

聚类所有相似区域大于k-1bp的 contigs
构图 (区分不同的 “components”)
将reads比对回 components，进行验证

Butterfly （蝶）（Java）

拆分graph 为线性序列
使用reads以及 pairs关系消除错误序列 Trinity组装依据的算法是de Bruijn Graph,即从打断的文库中提取一定长度的K-mer，然后根据k-1错位相似的方法拼接组装的可能路径，最终确定完整的参考组装转录组。

主要编程语言：Perl
GItHub：https://github.com/trinityrnaseq/trinityrnaseq
文档：https://github.com/trinityrnaseq/trinityrnaseq/wiki

发表文章

题目：Full-length transcriptome assembly from RNA-Seq data without a reference genome
期刊: Nature Biotechnology
日期：2011年5月15日
作者&单位：Manfred G Grabherr, Brian J Haas and Moran Yassour & Broad Institute of Massachusetts Institute of Technology and Harvard
DOI：https://doi.org/10.1038/nbt.1883

如何安装

详见：https://github.com/trinityrnaseq/trinityrnaseq/wiki/Installing-Trinity

conda 安装

推荐使用conda安装，可以自动解决一系列依赖软件的安装.

conda create -n  noref_rnaseq
conda activate noref_rnaseq
#conda install trinity
mamba install trinity
conda install rsem

可以使用conda直接安装，但是需要注意的一点是，可能是所需依赖太多，我用conda安装命令尝试安装了多次都是失败，这时候可以换用mamba试试，同样的镜像，使用mamab 很快就安装好了。

singularity 调用

如果服务器有配置singularity，也可以使用singularity调用Trinity。如果不了解singularity的可以参考：Singularity — 生信流程搭建好帮手

下载singularity镜像文件（https://data.broadinstitute.org/Trinity/TRINITY_SINGULARITY/）

wget -c https://data.broadinstitute.org/Trinity/TRINITY_SINGULARITY/trinityrnaseq.v2.15.1.simg

#文件大小
2.6G 2月   6  2023 trinityrnaseq.v2.15.1.simg

硬件要求注意：基本建议是每 ~1M Illumina reads 对，有 ~1G RAM，还可能需要数百 GB 的可用磁盘空间，并且可以在运行期间生成数千个中间文件。最好有一个临时工作区，并有足够的磁盘空间在任务执行期间使用。

最小化使用

Trinity 组装

#基本
Trinity --seqType fq --max_memory 50G \
         --left reads_1.fq.gz  --right reads_2.fq.gz --CPU 6

# 使用 singularity
singularity exec -e Trinity.simg  Trinity \
          --seqType fq \
          --left `pwd`/reads_1.fq.gz  \
          --right `pwd`/reads_2.fq.gz \
          --max_memory 1G --CPU 4 \
          --output `pwd`/trinity_out_dir


--seqType #输入文件类型
--CPU #设置程序调用的CPU数,默认2
--max_memory #设置程序可使用的最大内存
--left #双端测序的左端，eg:reads_1.fq.gz ；如果有多个文件，可以使用逗号(,)分隔
--right #双端测序的右端，eg:reads_2.fq.gz ；如果有多个文件，可以使用逗号(,)分隔
--single #单端测序数据
--samples_file #制表符分隔的文本文件。如果文件过多，可以将信息写入samples文件中
--monitoring #统计资源使用情况
--output #设置输出文件目录。默认是在工作目录创建'trinity_out_dir'文件夹。注意自定义目录的时候，目录命名必须包含"trinity"
--SS_lib_type #链特异性数据需要设置的参数

注：如果你有特别大的 RNA-Seq 数据集，涉及数亿到数十亿的reads，就需要开启 silico normalizatio （计算机模拟归一化），能有效减少trinity运行时间。（现在新版本是默认开启状态）
目前版本，运行Trinity 除了生成 Trinity.fasta，还直接生成了salmon定量的结果

定量

不管是否采用基于比对的估计，首次运行的时候都需要采用 --prep_reference 构建索引。

基于比对的丰度估计——RSEM

align_and_estimate_abundance.pl \
--transcripts unigene1.fasta --seqType fq \
--left reads_1.fq.gz --right reads_2.fq.gz \
--est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference \
--output_dir trinity_RSEM_out_dir

--transcripts #转录本fasta文件
--seqType #指定输入数据类型
--est_method  #选择丰度估计方法；基于比对的方法，eg：RSEM ；非比对的方法，eg:kallisto 、salmon
--aln_method #用于指定比对的方法， bowtie|bowtie2 
--thread_count #设置调用的线程数
--trinity_mode #启用Trinity模式，工具会自动生成`gene_trans_map`（一个包含“基因(tab)转录本”标识符的文件）并使用它
--prep_reference #构建索引。进行基于比对的方法（如RSEM）时，需先准备参考序列的索引

无比对的丰度估计

使用kallisto 或者 salmon 。这两个软件前面我们也介绍过：

Kallisto — 基于伪比对的转录本定量

Salmon — 兼具高效、精准及偏差感知的RNA-seq定量工具

##kallisto
align_and_estimate_abundance.pl \
--transcripts Trinity.fasta --seqType fq \
--left reads_1.fq --right reads_2.fq \
--est_method kallisto --trinity_mode --prep_reference \
--output_dir kallisto_quant

##合并所有样本kallisto的结果，生成定量矩阵文件
abundance_estimates_to_matrix.pl --est_method kallisto \
    --gene_trans_map Trinity.fasta.gene_trans_map \
    --out_prefix kallisto \
    --name_sample_by_basedir \
     sampleA/abundance.tsv \
     sampleB/abundance.tsv 
 
##salmon
align_and_estimate_abundance.pl \
--transcripts Trinity.fasta --seqType fq \
--left reads_1.fq --right reads_2.fq  \
--est_method salmon --trinity_mode --prep_reference \
--output_dir salmon_quant

##合并所有样本salmon的结果，生成定量矩阵文件
find . -maxdepth 2 -name "quant.sf" | tee salmon.quant_files.txt

abundance_estimates_to_matrix.pl --est_method salmon \
      --gene_trans_map trinity_out_dir.Trinity.fasta.gene_trans_map \
      --quant_files salmon.quant_files.txt \
      --name_sample_by_basedir

--gene_trans_map #指定一个基因到转录本的映射文件。这个文件的每一行是一个基因和转录本的对应关系。如果你只对转录本层面的估计感兴趣，可以指定`'none'`，跳过基因汇总
--name_sample_by_basedir #这个选项会将样本列名设置为目录名称而不是文件名称。适合当文件名称相同时用目录名称来区分不同样本。
--quant_files #一个包含所有丰度估计结果文件路径的列表文件，每一行是一个丰度估计结果文件的路径。使用该参数避免在命令行中手动列出每个文件路径。

实例演示

运行Trinity

fq1=~/project/FL8/st1_data/C1_1.fq.gz
fq2=~/project/FL8/st1_data/C1_2.fq.gz
outdir=~/trinity_test/trinity_out_dir

/usr/bin/time -v Trinity --seqType fq --max_memory 50G \
         --left ${fq1}  --right ${fq2} \
         --CPU 6 --monitoring \
         --output ${outdir}

Trinity 完成后，它将创建一个 trinity_out_dir ，和输出 trinity_out_dir.Trinity.fasta 、trinity_out_dir.Trinity.fasta.gene_trans_map (或基于您指定的输出目录的前缀）。

序列编号规则。序列编号编码了 Trinity 的 “基因” 和 “转录本” 信息。因为一次 Trinity 运行涉及许多读段簇，每个读段簇都是单独组装的，并且由于 “基因” 编号在给定的已处理读段簇中是唯一的，所以 “基因” 标识符应被视为读段簇和相应基因标识符的组合，以”TRINITY_DN6201_c0_g1_i1“ 为例：

DN通常代表de novo组装编号。6201是该特定组装单元的唯一编号。它表示这是组装过程中发现的第10个独立的基因簇（或组装单元）
c0表示组件编号（component number）。每个组件代表一个可能属于同一个基因的序列集合。c0表示这是该基因簇中的第0号组件。组件是基于组装过程中读段的重叠关系定义的。
g1代表基因编号（gene number）。在组件内，Trinity会将可能代表同一基因的序列分组，g1表示这是组件c0中的第1个基因。
i1表示转录本编号（isoform number）。对于每个基因，Trinity可能会找到多个不同的转录本（即不同的mRNA异构体）。i1表示这是基因g1中的第1个转录本。

所以 ”TRINITY_DN6201_c0_g1_i1“ 表示 Trinity 读段簇为 “TRINITY_DN6201_c0”，基因是 “g1”，转录本是 “i1”。

trinity_out_dir.Trinity.fasta.gene_trans_map 文件是基因ID和转录本ID的对应关系

trinity_out_dir 目录包含以下内容：

$tree -L 1
.
├── both.fa
├── both.fa.ok
├── both.fa.read_count
├── chrysalis
├── inchworm.DS.fa
├── inchworm.DS.fa.finished
├── inchworm.kmer_count
├── insilico_read_normalization
├── jellyfish.kmers.25.asm.fa
├── jellyfish.kmers.25.asm.fa.histo
├── left.fa.ok
├── partitioned_reads.files.list
├── partitioned_reads.files.list.ok
├── pipeliner.932460.cmds
├── read_partitions
├── recursive_trinity.cmds
├── recursive_trinity.cmds.completed
├── recursive_trinity.cmds.ok
├── right.fa.ok
├── __salmon_filt.chkpts
├── salmon_outdir
├── scaffolding_entries.sam
├── Trinity.timing
├── Trinity.tmp.fasta
└── Trinity.tmp.fasta.salmon.idx

6 directories, 19 files

RSEM 定量

align_and_estimate_abundance.pl \
--transcripts Trinity.fasta --seqType fq \
--left reads_1.fq.gz --right reads_2.fq.gz \
--est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference \
--output_dir trinity_RSEM_out_dir

RSEM 计算生成两个包含丰度估计信息的主要输出文件：RSEM.isoforms.results 和 RSEM.genes.results 、每列内容如下：

transcript_id ：转录本的唯一标识符
gene_id ：基因的唯一标识符
length ：该基因或转录本的总长度，以碱基对（bp）为单位
effective_length：有效长度，即经过校正后可以用于定量分析的序列长度
expected_count：期望读段数，即RSEM估计的在该基因或转录本上比对到的读段数量。这个值并不等同于原始的比对读段数，而是经过算法校正后的结果。
TPM：标准化后的表达值，表示每百万个读段中有多少读段映射到该转录本。
FPKM：标准化的表达量单位，表示每千碱基转录本每百万个比对读段中的片段数。
IsoPct：表示该转录本在其所属基因的所有转录本中所占的百分比。它反映了该转录本在所有该基因转录本中的丰度比例。

kallisto 定量

fasta=~/trinity_test/trinity_out_dir.Trinity.fasta
fq1=~/project/FL8/st1_data/C1_1.fq.gz
fq2=~/project/FL8/st1_data/C1_2.fq.gz
outdir=~/trinity_test/kallisto_quant

align_and_estimate_abundance.pl \
--transcripts trinity_out_dir.Trinity.fasta --seqType fq \
--left ${fq1} --right ${fq2} \
--est_method kallisto --trinity_mode --prep_reference \
--output_dir ${outdir}

结果文件

904K 10月  1 19:11 abundance.h5
3.4M 10月  1 19:11 abundance.tsv
1.9M 10月  1 19:11 abundance.tsv.genes
 526 10月  1 19:11 run_info.json

转录本ID定量结果

salmon 定量

fasta=~/trinity_test/trinity_out_dir.Trinity.fasta
fq1=~/project/FL8/st1_data/C1_1.fq.gz
fq2=~/project/FL8/st1_data/C1_2.fq.gz
outdir=~/trinity_test/salmon_quant

align_and_estimate_abundance.pl \
--transcripts trinity_out_dir.Trinity.fasta --seqType fq \
--left ${fq1} --right ${fq2} \
--est_method salmon --trinity_mode --prep_reference \
--output_dir ${outdir}