Stringtie outputs mainly STRG ids rather than gene ids

laurenzane commented 2 years ago

Previously, our team was not able to successfully output a gene count matrix from Stringtie. I was able to run Stringtie on Andromeda and output a gene count matrix but many of the rownames are designated as STRG rather than a gene id name. This becomes difficult to perform GOSeq to determine gene function in a downstream analysis as GOSeq can only use gene ids from an annotation to assign functionality. This has been particularly problematic for the Pocillipora acuta dataset.

We have determined that STRG ids are generated by Stringtie in the case of novel transcripts or transcripts not aligning with the genome. This should not be occurring because we are using the following commands:

Reference annotation transcripts (-G) A reference annotation file in GTF or GFF3 format can be provided to StringTie using the -G option which can be used as 'guides' for the assembly process and help improve the transcript structure recovery for those transcripts.

NOTE: we highly recommend that you provide annotation if you are analyzing a genome that is well annotated, such as human, mouse, or other model organisms.

Note that when a reference transcript is fully covered by input read alignments, the original transcript ID from the reference annotation file will be shown in StringTie's output file in the reference_id GTF attribute for that assembled transcript. Output transcripts lacking the reference_id attribute can be considered "novel" transcript structures with respect to the provided reference annotation.

Expression estimation mode (-e) When the -e option is used, the reference annotation file -G is a required input and StringTie will not attempt to assemble the input read alignments but instead it will only estimate the expression levels of the "reference" transcripts provided in the -G file.

With this option, no "novel" transcript assemblies (isoforms) will be produced, and read alignments not overlapping any of the given reference transcripts will be ignored, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes for example.

@laurenzane will rerun Stringtie with the -A option to better understand how many gene ids are being recovered from the RNA-seq dataset.