Open rkendar opened 6 years ago
Not sure what's happening there, in the GTF file the transcripts have proper gene_id which is missing from the gene matrix ? I suppose this could be possible if those transcripts were somehow attributed to another gene_id which perhaps swallowed (merged) the original gene id you are looking up? Perhaps you can send me a GTF as an example where prepDE.py does this so I can take a closer look and figure out what's going on.
Hi Geo,
After I ran StringTie merge, I did gffcompare and it gave me .tmap and annotated.gft. So there is a disagreement in geneID between .tmap and .annotated .gtf (which will be used as gene_id in gene count matrix.
For example, there are 3 geneIDs belong to ESR2 based on .tmap. In ._tmap$qry_geneid is listed:
But, in gene count matrix I cannot find "ENSG00000140009" which is the EnsemblID of ESR2. After I take a look in .annotated .gtf, the geneID that named ESR2 in .tmap is called "MSTRG.20835" instead of its EnsembleID. And I can found MSTRG.20835 in the gene count matrix.
My understanding is StringTie will give "MSTRG" geneID if it is not perfectly match with ref_gene. Why StringTie give "MSTRG" instead of "ENSGxxx" in geneID even though it has perfect matched ("=" class code)? Please correct me if I am wrong.
I have attached the .annotated.gtf and .tmap so you can investigate more. tmap_&_annotated_gtf.txt
Thank you.
Hello,
I am using the rn6.v90 reference and having the same issue. The reference has the full ~30,000 Ensembl gene Ids, but the merged gtf that comes out of stringtie And therefore the RSEM output) only has 5,019 unique Ensembl gene IDs. I checked my dataset to see if i could replicate what @rkendar hypothesized: "My understanding is StringTie will give "MSTRG" geneID if it is not perfectly match with ref_gene.", but this number did not match 5,019. I also saw examples where: "StringTie give "MSTRG" instead of "ENSGxxx" in geneID even though it has perfect matched." Can you please clarify how stringtie is assigning gene names?
Thank you Harry
Hi Developer,
I am doing differential expression from whole RNA-seq data. My pipeline is HISAT-Stringtie-Gffcompare. I have assembled and merge the .gtf with Stringtie, also I have extracted the count matrix (genes and transcripts) using prepDE.py.
Now, for example, I am looking for ESR1 expression, I found 10 transcripts that belong to ESR1 in transcripts count matrix. However, when I check it in gene count matrix I cannot find the ESR1 ("ENSG00000091831"). Is it normal to find the transcript count with no gene count that belong to the same gene? Need your suggestion.