gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
380 stars 78 forks source link

Gene A has transcript counts but no gene counts #176

Open rkendar opened 6 years ago

rkendar commented 6 years ago

Hi Developer,

I am doing differential expression from whole RNA-seq data. My pipeline is HISAT-Stringtie-Gffcompare. I have assembled and merge the .gtf with Stringtie, also I have extracted the count matrix (genes and transcripts) using prepDE.py.

Now, for example, I am looking for ESR1 expression, I found 10 transcripts that belong to ESR1 in transcripts count matrix. However, when I check it in gene count matrix I cannot find the ESR1 ("ENSG00000091831"). Is it normal to find the transcript count with no gene count that belong to the same gene? Need your suggestion.

gpertea commented 6 years ago

Not sure what's happening there, in the GTF file the transcripts have proper gene_id which is missing from the gene matrix ? I suppose this could be possible if those transcripts were somehow attributed to another gene_id which perhaps swallowed (merged) the original gene id you are looking up? Perhaps you can send me a GTF as an example where prepDE.py does this so I can take a closer look and figure out what's going on.

rkendar commented 6 years ago

Hi Geo,

After I ran StringTie merge, I did gffcompare and it gave me .tmap and annotated.gft. So there is a disagreement in geneID between .tmap and .annotated .gtf (which will be used as gene_id in gene count matrix.

For example, there are 3 geneIDs belong to ESR2 based on .tmap. In ._tmap$qry_geneid is listed:

But, in gene count matrix I cannot find "ENSG00000140009" which is the EnsemblID of ESR2. After I take a look in .annotated .gtf, the geneID that named ESR2 in .tmap is called "MSTRG.20835" instead of its EnsembleID. And I can found MSTRG.20835 in the gene count matrix.

My understanding is StringTie will give "MSTRG" geneID if it is not perfectly match with ref_gene. Why StringTie give "MSTRG" instead of "ENSGxxx" in geneID even though it has perfect matched ("=" class code)? Please correct me if I am wrong.

I have attached the .annotated.gtf and .tmap so you can investigate more. tmap_&_annotated_gtf.txt

Thank you.

hsmith9002 commented 6 years ago

Hello,

I am using the rn6.v90 reference and having the same issue. The reference has the full ~30,000 Ensembl gene Ids, but the merged gtf that comes out of stringtie And therefore the RSEM output) only has 5,019 unique Ensembl gene IDs. I checked my dataset to see if i could replicate what @rkendar hypothesized: "My understanding is StringTie will give "MSTRG" geneID if it is not perfectly match with ref_gene.", but this number did not match 5,019. I also saw examples where: "StringTie give "MSTRG" instead of "ENSGxxx" in geneID even though it has perfect matched." Can you please clarify how stringtie is assigning gene names?

Thank you Harry