mhammell-laboratory / TElocal

A package for quantifying transposable elements at a locus level for RNAseq datasets.
GNU General Public License v3.0
21 stars 8 forks source link

quantify in the family level #33

Closed AlisaGU closed 8 months ago

AlisaGU commented 8 months ago

Hi, TEtranscripts can quantify TE expression in the family levels using case and control groups. However, my samples were divided into several embryonic stages and were not suitable using TEtranscripts.

Can I modify my TE gtf file (changing TE level of each locus to class level ) to work out? For example, the original gtf is like that:

ptg001153l  RepeatMasker    exon    9   1239    6123    -   .   gene_id "Denovo_TE00000001"; transcript_id "Denovo_TE00000001"; family_id "rnd-6_family-1051 4089 5314"; class_id "LTR/Gypsy
ptg001153l  RepeatMasker    exon    3165    3188    508 -   .   gene_id "Denovo_TE00000002"; transcript_id "Denovo_TE00000002"; family_id "rnd-1_family-17 869 890"; class_id "DNA
ptg001153l  RepeatMasker    exon    3189    3423    1285    +   .   gene_id "Denovo_TE00000003"; transcript_id "Denovo_TE00000003"; family_id "rnd-1_family-155 1 234"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3425    3587    1006    +   .   gene_id "Denovo_TE00000004"; transcript_id "Denovo_TE00000004"; family_id "rnd-6_family-1292 2104 2267"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3892    4386    1410    +   .   gene_id "Denovo_TE00000005"; transcript_id "Denovo_TE00000005"; family_id "rnd-1_family-234 658 1504"; class_id "LINE/L2

Modified version is like that (merge the 3rd and 4th into one gene_id "Denovo_TE00000003"):

ptg001153l  RepeatMasker    exon    9   1239    6123    -   .   gene_id "Denovo_TE00000001"; transcript_id "Denovo_TE00000001"; family_id "rnd-6_family-1051 4089 5314"; class_id "LTR/Gypsy
ptg001153l  RepeatMasker    exon    3165    3188    508 -   .   gene_id "Denovo_TE00000002"; transcript_id "Denovo_TE00000002"; family_id "rnd-1_family-17 869 890"; class_id "DNA
ptg001153l  RepeatMasker    exon    3189    3423    1285    +   .   gene_id "Denovo_TE00000003"; transcript_id "Denovo_TE00000003"; family_id "rnd-1_family-155 1 234"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3425    3587    1006    +   .   gene_id "Denovo_TE00000003"; transcript_id "Denovo_TE00000003"; family_id "rnd-6_family-1292 2104 2267"; class_id "SINE/tRNA-Deu
ptg001153l  RepeatMasker    exon    3892    4386    1410    +   .   gene_id "Denovo_TE00000004"; transcript_id "Denovo_TE00000004"; family_id "rnd-1_family-234 658 1504"; class_id "LINE/L2
olivertam commented 8 months ago

Hi,

Thank you for your interest in the software.

I'm not completely sure what you're hoping to do. Are you hoping to quantify libraries at the locus rather than subfamily level? Or are you hoping to aggregate by class_id instead?

If you don't want to do case-control with TEtranscripts, you can just quantify each library independently with TEcount (part of TEtranscripts), and then combine/compare developmental stages as you wish. This is the default mode in TElocal (i.e. does not do differential analysis, just quantify).

TEtranscripts aggregates using the gene_id name, but TElocal does not aggregate and uses the transcript_id as the distinguishing annotation. Thus, each transcript_id name needs to be unique (at least in the current version). Thus, if you want to merge different entries into the same gene_id, that is possible, but would not have any effect in TElocal (and will break it since the transcript_id is no longer unique).

I'd be happy to discuss further to know exactly what you want, but it's not clear if your proposed GTF modification would generate your desired effect.

Thanks.

AlisaGU commented 8 months ago

Sorry to the confusing description. I want to quantify TE expression level in family and each locus level. TElocal is under preparation (each locus level) and TEcount seems to be my next work by your suggestion.

AlisaGU commented 8 months ago

Is it ok to consider the sum of each locus TE expression count belonging to one family as the total expression count of this family? If ok, it's no need to run TEcount

olivertam commented 8 months ago

Ah, I see what you mean now.

To address your latest question: yes, theoretically you can count up the total expression of a family (though this information appears to be in the class_id section) and get to the same (perhaps with slight variation due to EM) result as if you ran TEcount aggregating at the class_id level.

If you actually want to modify your GTF, you would need to transfer the class_id value to the gene_id value, while keeping everything else the same. However, as you pointed out, you probably don't need to do so if you're already running TElocal.

Thanks.

AlisaGU commented 8 months ago

Thanks!