Closed surendrk closed 1 year ago
hmmm, is there some pattern to the gene IDs that are being lost perhaps?
Yes, the prepDE.py results in gene symbols with both contig/locus-id along with gene name (for example: MSTRG.9937|gmcl1 or LOC117742121|LOC117742121). Looking into it in greater detail this week.
Hmmm, maybe something relating to how the gene names are being parsed with a regex then?
Hi, StringTie provides a python script (prepDE.py) for raw counts to be used by DESeq2 and edgeR at both transcript and gene level. When TPMs / FPKMs values are extracted using 'stringtie_expression_matrix.pl', the numbers match perfectly at transcript level; however, at the gene level, the number does not match and many of the genes seems missing (~12K at gene level are missing when compared to count matrix obtained for genes using prepDE.py). I am using the StringTie version 2.1.5. Is it because of paralogs? Atlantic salmon genome is also polyploidy, so the number of duplicated genes or paralogues are also high.
Thanks, SK