gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

Transcriptome assembly with STAR or HISAT #158

Open luigra opened 6 years ago

luigra commented 6 years ago

Dear Geo,

I had different assembled transcriptomes depending by the usage of STAR or HISAT2 with the latest version of stringtie. You can see a screenshot of a region I am interested in. strtieigv_snapshot

Is there something that I need to do, other than sorting the bam by coordinates, in order to have a reliable transcriptome by using the alignment results from star?

Thanks in advance Luigi

Below you can find the command I used for it: /home/cbrcmod/scratch/modules/out/modulebin/stringtie/1.3.3/bin/stringtie /scratch/cbrc/analysis/BPepi-22/out/bam/STAR/ERR878367.sorted.bam -o /scratch/cbrc/analysis/BPepi-22/tmp/stringtie/ERR878367.gtf -p 6 --rf -G /scratch/cbrc/ref/ensembl/human/GRCh37.75/blueprint/Homo_sapiens.GRCh37.75.chr.gtf -v -l BPSTRG

/home/cbrcmod/scratch/modules/out/modulebin/stringtie/1.3.3/bin/stringtie /scratch/cbrc/analysis/BPepi-27/out/bam/hisat2/ERR878367.sorted.bam -o /scratch/cbrc/analysis/BPepi-27/tmp/stringtie/hisat2/ERR878367.gtf -p 8 --rf -G /scratch/cbrc/ref/ensembl/human/GRCh37.75/blueprint/Homo_sapiens.GRCh37.75.chr.gtf -v -l STRG

If can help I noticed that stringtie 1.2.3 on the star alignment was able to detect the new 5' exons and the longer 3 utr exon alongside with other transcripts (I included the relative track in the screenshot). This i the commands I used for it: stringtie /opt/data3/Projects/BPepi/BPepi-8/out/bam/STAR/ERR878367.sorted.bam -o /opt/data3/Projects/BPepi/BPepi-8/stringtie/MK/ERR878367/first_transcript.gtf -p 4 -G /opt/data2/mak58/blueprint/annotation/Homo_sapiens.GRCh37.70.gtf -v

gpertea commented 6 years ago

What is the alignment track shown there, it's from STAR or HISAT2 ? (since the BAM file seem to be named the same in both cases, ERR878367.sorted.bam). Anyway we cannot provide support for STAR alignments, HISAT2 is the recommended and supported aligner for StringTie (and you did not show the command line for the aligners, which seems to be crucial for the issue you raised here; I wouldn't be able to comment on STAR parameters anyway). It does seem weird that stringtie v1.2.3 was able to find all those many more isoforms based on STAR alignments, but then again I seem to recall that v1.2.3 had some bug which showed 0 FPKM assemblies so perhaps it was also generating a lot of seemingly low expression isoforms.. While v1.3.3 is probably better at filtering out spurious alignments (and thus not showing potentially low-expression, low-probability assemblies). It's hard to answer your question without comparing the accuracy/validity of alignments produced by HISAT2 vs STAR in this case, which is a very important part of the answer here.

luigra commented 6 years ago

The alignment results, at least in this region, look very similar.

hisat_star_igv_snapshot

Below the commands I used for hisat2: --known-splicesite-infile /scratch/cbrc/ref/ensembl/human/GRCh37.75/blueprint/Homo_sapiens.GRCh37.75.chr_ERCC92.ss --rna-strandness RF --downstream-transcriptome-assembly -p 16

luigra commented 6 years ago

Wanting to go further with this comparison I simulated Fastq reads from the reference transcriptome with RSAT and I ran STAR and hisat2. As downstream analysis I assembled the transcriptomes from both aln w/o the parameters -G and --rf. Then I compared both reconstructed transcriptomes with the source one by using gffcompare. Below the results that look very similar to me

hisat2 Sensitivity Precision
Base level 66.2 95.6
Exon level 47.3 94.0
Intron level 61.7 99.3
Intron chain level 15.6 66.2
Transcript level 15.8 63.4
Locus level 71.3 78.3
STAR Sensitivity Precision
Base level 66.1 96.4
Exon level 47.2 94.7
Intron level 61.3 99.6
Intron chain level 15.6 68.2
Transcript level 15.9 66.1
Locus level 73.2 82.3