griffithlab / rnabio.org

website for the rnaseq course
http://rnabio.org/
MIT License
87 stars 35 forks source link

StringTie prepDE.py and stringtie_expression_matrix.pl #51

Closed surendrk closed 1 year ago

surendrk commented 3 years ago

Hi, StringTie provides a python script (prepDE.py) for raw counts to be used by DESeq2 and edgeR at both transcript and gene level. When TPMs / FPKMs values are extracted using 'stringtie_expression_matrix.pl', the numbers match perfectly at transcript level; however, at the gene level, the number does not match and many of the genes seems missing (~12K at gene level are missing when compared to count matrix obtained for genes using prepDE.py). I am using the StringTie version 2.1.5. Is it because of paralogs? Atlantic salmon genome is also polyploidy, so the number of duplicated genes or paralogues are also high.

Thanks, SK

malachig commented 3 years ago

hmmm, is there some pattern to the gene IDs that are being lost perhaps?

surendrk commented 3 years ago

Yes, the prepDE.py results in gene symbols with both contig/locus-id along with gene name (for example: MSTRG.9937|gmcl1 or LOC117742121|LOC117742121). Looking into it in greater detail this week.

malachig commented 3 years ago

Hmmm, maybe something relating to how the gene names are being parsed with a regex then?