mhammell-laboratory / TElocal

A package for quantifying transposable elements at a locus level for RNAseq datasets.
GNU General Public License v3.0
21 stars 8 forks source link

TE Coordinates and TEsmall question #10

Closed juanb001 closed 3 years ago

juanb001 commented 3 years ago

Hi,

I was wondering if you could provide GTF or BED files with the coordinates of the TEs in the gtf.loc files currently available on your servers? Or do the names and coordinates of each insertion correspond to the names and coordinates used in the GTF file available for TEtranscripts?

Separate question regarding TEsmall (since I'm not sure if the Github for that is being actively maintained): would it be possible to incorporate native support for the newer human (hg38)/mouse (mm10) genomes? It seems to me that you can currently specify mouse/human and the tool will download the corresponding annotation files, but for hg19/mm9. It would be awesome if the tool natively supported the newer genome builds!

Thanks for the help and tools!

olivertam commented 3 years ago

Hi,

Please note that with the new release of TElocal (version 1.1 or later), the pre-built indices are now XXXX.locInd files. They can be obtained here. It is my understanding that the GTF used to create these indices are the same ones used for TEtranscripts. If this is not the case, I will post another comment on this thread.

Thanks.

juanb001 commented 3 years ago

Great, thanks for the clarification!

A follow up question, is it possible to replicate the results from TETranscripts using TElocal? In other words, if I add the counts from all the loci for a TE subfamily (for example, L1 HS), are the results the same as those provided by TETranscripts?

olivertam commented 3 years ago

Hi,

That's a great question. When we ran the same sample using both software, we do get quite similar (but not identical) results. Here are some examples of the comparison:

#TE    TEtranscripts    TElocal
MIRb:MIR:SINE   162834  160647
AluJb:Alu:SINE  156699  152751
L2a:L2:LINE     142254  140068
MIR:MIR:SINE    127087  125202
AluSx1:Alu:SINE 124528  119677
AluSx:Alu:SINE  118349  113513
AluSz:Alu:SINE  116267  112332
L2c:L2:LINE     112002  110648
AluJr:Alu:SINE  104241  101732
AluY:Alu:SINE   88927   83356
AluJo:Alu:SINE  88044   85755
L2b:L2:LINE     69778   68845
MIRc:MIR:SINE   68106   67251
AluSq2:Alu:SINE 65393   62699
MIR3:MIR:SINE   56357   55740
AluSz6:Alu:SINE 55794   54165
AluSp:Alu:SINE  55620   52991
L2:L2:LINE      53589   52829
L3:CR1:LINE     43646   43000
L1M5:L1:LINE    39980   38908
AluSg:Alu:SINE  39901   38099
...
L1HS:L1:LINE    2807    2680

We find that while the numbers are not exactly the same, and notice that TElocal tend to "undercount". This is probably due to our conservative approach with the EM, which might be having a harder time redistributing a read when it can go to more loci (vs a TE subfamily).

In most TE, the differences are not too strong (on average 2-3% less read in TElocal). However, we have noticed that LTR elements tend to differ between the two algorithms a lot more, and we suspect that this is caused by the separation of the LTR and the internal sequences in the RepeatMasker annotation, which might further exacerbates the redistribution. We are exploring how much impact this is having on our analyses, and considering various ideas on how to address this. Hope this answers your question.

Thanks.

juanb001 commented 3 years ago

Hi again,

I just wanted to double check something; can you confirm that the following GTF and locInd files are related?

human: hg38_rmsk_TE.gtf.locInd hg38_rmsk_TE_20200804.gtf.gz

mouse: mm10_rmsk_TE.gtf.locInd mm10_rmsk_TE.gtf.gz

Thanks!

olivertam commented 3 years ago

Hi,

Yes, the GTF files were used to build the locInd files. The nomenclature used in the locInd files (and thus TElocal output) should correspond to the transcript_id in the GTF. Please let us know if you see any discrepancies.

Thanks.

juanb001 commented 3 years ago

Awesome, thanks!

olivertam commented 3 years ago

We are now providing annotation tables indicating the genomic location of the TE corresponding to the name outputted by TElocal. Please make sure that you're using the table that corresponds to the TElocal index file (*.locInd). Thanks.