mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

quantify in the family level #156

Closed olivertam closed 7 months ago

olivertam commented 8 months ago

Reposted from mhammell-laboratory/TElocal#33

Hi, TEtranscripts can quantify TE expression in the family levels using case and control groups. However, my samples were divided into several embryonic stages and were not suitable using TEtranscripts.

Can I modify my TE gtf file (changing TE level of each locus to class level ) to work out? For example, the original gtf is like that:

ptg001153l  RepeatMasker    exon    9   1239    6123    -   .   gene_id "Denovo_TE00000001"; transcript_id "Denovo_TE00000001"; family_id "rnd-6_family-1051 4089 5314"; class_id "LTR/Gypsy
ptg001153l  RepeatMasker    exon    3165    3188    508 -   .   gene_id "Denovo_TE00000002"; transcript_id "Denovo_TE00000002"; family_id "rnd-1_family-17 869 890"; class_id "DNA
ptg001153l  RepeatMasker    exon    3189    3423    1285    +   .   gene_id "Denovo_TE00000003"; transcript_id "Denovo_TE00000003"; family_id "rnd-1_family-155 1 234"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3425    3587    1006    +   .   gene_id "Denovo_TE00000004"; transcript_id "Denovo_TE00000004"; family_id "rnd-6_family-1292 2104 2267"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3892    4386    1410    +   .   gene_id "Denovo_TE00000005"; transcript_id "Denovo_TE00000005"; family_id "rnd-1_family-234 658 1504"; class_id "LINE/L2

Modified version is like that (merge the 3rd and 4th into one gene_id "Denovo_TE00000003"):

ptg001153l  RepeatMasker    exon    9   1239    6123    -   .   gene_id "Denovo_TE00000001"; transcript_id "Denovo_TE00000001"; family_id "rnd-6_family-1051 4089 5314"; class_id "LTR/Gypsy
ptg001153l  RepeatMasker    exon    3165    3188    508 -   .   gene_id "Denovo_TE00000002"; transcript_id "Denovo_TE00000002"; family_id "rnd-1_family-17 869 890"; class_id "DNA
ptg001153l  RepeatMasker    exon    3189    3423    1285    +   .   gene_id "Denovo_TE00000003"; transcript_id "Denovo_TE00000003"; family_id "rnd-1_family-155 1 234"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3425    3587    1006    +   .   gene_id "Denovo_TE00000003"; transcript_id "Denovo_TE00000003"; family_id "rnd-6_family-1292 2104 2267"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3892    4386    1410    +   .   gene_id "Denovo_TE00000004"; transcript_id "Denovo_TE00000004"; family_id "rnd-1_family-234 658 1504"; class_id "LINE/L2
olivertam commented 8 months ago

If you don't want to do case-control with TEtranscripts, you can just quantify each library independently with TEcount (part of TEtranscripts), and then combine/compare developmental stages as you wish.

TEtranscripts/TEcount aggregates using the gene_id name, and largely ignorestranscript_id name (though prefers it to be unique). Thus, if you want to merge different entries into the same gene_id, that is possible, but you would want to make sure the transcript_id is unique. We don't typically recommend aggregating at the class_id level (you lose a lot of information), but in less well-annotated genomes, perhaps aggregating at the family_id level is possible (in which case, we recommend assigning the family_id value to the gene_id, but keeping the rest (transcript_id, family_id & class_id unchanged).

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days