gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

Expected ballgown output when not using '-e' #349

Closed gabepen closed 2 years ago

gabepen commented 2 years ago

I am trying to identify novel transcripts in differential expression data and am producing .gtfs that contain STRG prefix entries that are of interest. However I can't seem to find any of these transcripts in the ballgown tables. I thought this would be expected behavior of the '-e' flag, but since I have been explicitly leaving it out of the stringtie command I am confused.

gpertea commented 2 years ago

The -e flag only estimates the expression of whatever transcripts are given for the -G option file, and -B/-b ballgown tables are also counting the reads mapped to those same transcripts. So if you put the novel STRG transcripts or any other transcripts in the -G file, you will get the ballgown tables for those transcripts.

gabepen commented 2 years ago

The -e flag only estimates the expression of whatever transcripts are given for the -G option file, and -B/-b ballgown tables are also counting the reads mapped to those same transcripts. So if you put the novel STRG transcripts or any other transcripts in the -G file, you will get the ballgown tables for those transcripts.

So even when not using the -e flag, I still need to add the novel transcripts to a gtf and rerun the ballgown table generation with that gtf?

gpertea commented 2 years ago

Yes, the ballgown tables were really intended for use with -e, in a pipeline like the one described here: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de

Notice how in the pipeline described there, novel transcripts resulted from the assembly of each sample are collected and merged into a common, larger set of transcripts which is then used just for expression estimation by re-running StringTie with -eB for each sample.

You can apply the same principle even if you only have one sample, merge the assembly output of that sample with a reference set of transcripts (using stringtie --merge or other methods) and then assess the abundance of the resulting merged set of transcripts (which would have novel+known) by re-running stringtie with -eB on it, which should be quite fast.