results format based on "transcript_id"

mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.

http://hammelllab.labsites.cshl.edu/software/#TEtranscripts

GNU General Public License v3.0

206 stars 29 forks source link

results format based on "transcript_id" #20

Closed vasilislenis closed 6 years ago

vasilislenis commented 6 years ago

Hi Oliver,

First of all, I would like to thank you very much for your very informative answer about the statistics results. I am sorry for bothering you again but I was wondering if there is any chance to have an output based on the "transcript_id" and not only based on "gene_id" and "family_id"? By examining the results I found that TEtrascripts summarizes them based on the family information in the GTF files.

Is there any option to have the count numbers for each member of the family or should I change the GTF file?

Thank you very much in advance, Vasilis.

olivertam commented 6 years ago

Hi Vasilis,

TEtranscripts typically summarizes the read count based on the "gene_id" (but also includes the "class_id" and "family_id" in the name so that you can subsequently group by class & family if required).

It is theoretically possible to summarize them based on the "transcript_id" (a.k.a. individual instances of the transposable element). However, we do not advise it, as we don't think that the current methods and the technology can give a reliable/accurate estimation. See #13 for a previous discussion on this topic (as well as a proposed approach to summarize based on "transcript_id"). Thanks.

vasilislenis commented 6 years ago

Hi Oliver,

Thank you for the reply. I understand that there will be alterations in the statistical calculations, but I was wondering if I could rerun TEtranscripts by changing the gene names with the transcript names for the family that I already found significant with the first run, only to have the counts for each individual for plotting it. So, let's say that I found a significant up-regulation for a family. What I would like to do is to make a heatmap to see if this corresponds to all the individuals. So, by changing the gene ids with the transcript ids for this family I can keep the counts of the individuals and draw it. I don't care if the p-values are good or not since I found it with the first run.

Does it make sense? Vasilis.

olivertam commented 6 years ago

Yes, you can change the gene_id to the corresponding transcript_id in the TE GTF file, and you will then be able to count based on the individual TE instance in the genome. The caveat is that the results might not be reliable at that level due to technical limitations, and thus might not reflect the actual expression of the TE at that particular genomic position