gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

How can Stringtie decide the strand of a transcript assembled by a standard RNA library? #332

Closed ZHIDIHUAYUAN closed 3 years ago

ZHIDIHUAYUAN commented 3 years ago

Hi, I am new to the RNA-seq. From my opinion, the strandness is lost after both strands of cDNA are synthesized. My question is : (1) For a typical RNAseq, is the strand info from Stringtie relaible? (2) Should we just focus on the interval of transcript created by Stringtie and ignore the strand? In that case, we should search for ORF on both strand based on genome and decide the coding potential.

Thanks

gpertea commented 3 years ago

The strand info for alignments of reads from unstranded RNA-Seq libraries can be recovered based on two sources of evidence - the primary one being the genomic splice sites in the case of spliced alignments (intron in the alignment), which is encoded by the aligner into the XS tag for such alignments (that's why the large majority of transcript assemblies lacking strand info are single exon) . A secondary source can be the overlap with reference annotation transcripts (gffcompare can use that). So the answers to your queries can be:

(1) generally yes, but only in so far as the aligner provided strand info (the XS tag) is reliable (2) yes, that's how it is usually done - you can use programs like TransDecoder or EviGene to help with assessing the ORFs.

Be aware that those single-exon transcripts lacking strand info, especially those with low estimated coverage, can be transcriptional noise or alignment artifacts (i.e. not "real" transcripts). Some of those could be alignments to processed pseudogenes.

ZHIDIHUAYUAN commented 3 years ago

OK. Thank you very much!

ZHIDIHUAYUAN commented 3 years ago

Hi, Thank you very much for apply. I still have another question. First, I used stringtie(v2.1.5) to separately assembly the transcripts for three samples. stringtie -l SAMPLE1 -p 30 -G /GRCh38.p13/GRCh38_latest_genomic.gff -A sample1_gene_abund.tab -o sample1_stringtie.gtf sample1.sortedByCoord.out.bam

Then, I used the gffcompare to merge the three gtfs. gffcompare -o gffcompare -p GFFCOMPARE sample1_stringtie.gtf sample3_stringtie.gtf sample3_stringtie.gtf

At last, I used the stringtie to calculate the TPM using the merged gtf. stringtie -e -B -p 10 -G gffcompare.combined.gtf -o sample1_abun.gtf sample1.sortedByCoord.out.bam

But I found that the TPM differes very much. In gffcompare.tracking: GFFCOMPARE_00000001 XLOC_000002 - u q1:SAMPLE1.29|SAMPLE1.29.1|2|0.715953|2.892788|41.071835|4412

But in the sample1_abun.gtf: gene_id "XLOC_000002"; transcript_id "GFFCOMPARE_00000001"; cov "4.525456"; FPKM "0.099625"; TPM "0.359771";

Which one should I use? If there is no such transcript in the first assembly, its expression value should be 0, then even when the combined gtf (contain more transcripts) is used for quantification, the TPM should not be changed so much.

Am I wrong in some places? Hope to hear your advice.

Best wishes

Mengyu Zhou