Closed juanb001 closed 4 years ago
Hi,
Please note that with the new release of TElocal (version 1.1 or later), the pre-built indices are now XXXX.locInd files. They can be obtained here. It is my understanding that the GTF used to create these indices are the same ones used for TEtranscripts. If this is not the case, I will post another comment on this thread.
Thanks.
Great, thanks for the clarification!
A follow up question, is it possible to replicate the results from TETranscripts using TElocal? In other words, if I add the counts from all the loci for a TE subfamily (for example, L1 HS), are the results the same as those provided by TETranscripts?
Hi,
That's a great question. When we ran the same sample using both software, we do get quite similar (but not identical) results. Here are some examples of the comparison:
#TE TEtranscripts TElocal
MIRb:MIR:SINE 162834 160647
AluJb:Alu:SINE 156699 152751
L2a:L2:LINE 142254 140068
MIR:MIR:SINE 127087 125202
AluSx1:Alu:SINE 124528 119677
AluSx:Alu:SINE 118349 113513
AluSz:Alu:SINE 116267 112332
L2c:L2:LINE 112002 110648
AluJr:Alu:SINE 104241 101732
AluY:Alu:SINE 88927 83356
AluJo:Alu:SINE 88044 85755
L2b:L2:LINE 69778 68845
MIRc:MIR:SINE 68106 67251
AluSq2:Alu:SINE 65393 62699
MIR3:MIR:SINE 56357 55740
AluSz6:Alu:SINE 55794 54165
AluSp:Alu:SINE 55620 52991
L2:L2:LINE 53589 52829
L3:CR1:LINE 43646 43000
L1M5:L1:LINE 39980 38908
AluSg:Alu:SINE 39901 38099
...
L1HS:L1:LINE 2807 2680
We find that while the numbers are not exactly the same, and notice that TElocal tend to "undercount". This is probably due to our conservative approach with the EM, which might be having a harder time redistributing a read when it can go to more loci (vs a TE subfamily).
In most TE, the differences are not too strong (on average 2-3% less read in TElocal). However, we have noticed that LTR elements tend to differ between the two algorithms a lot more, and we suspect that this is caused by the separation of the LTR and the internal sequences in the RepeatMasker annotation, which might further exacerbates the redistribution. We are exploring how much impact this is having on our analyses, and considering various ideas on how to address this. Hope this answers your question.
Thanks.
Hi again,
I just wanted to double check something; can you confirm that the following GTF and locInd files are related?
human: hg38_rmsk_TE.gtf.locInd hg38_rmsk_TE_20200804.gtf.gz
mouse: mm10_rmsk_TE.gtf.locInd mm10_rmsk_TE.gtf.gz
Thanks!
Hi,
Yes, the GTF files were used to build the locInd
files.
The nomenclature used in the locInd
files (and thus TElocal output) should correspond to the transcript_id in the GTF.
Please let us know if you see any discrepancies.
Thanks.
Awesome, thanks!
We are now providing annotation tables indicating the genomic location of the TE corresponding to the name outputted by TElocal. Please make sure that you're using the table that corresponds to the TElocal index file (*.locInd). Thanks.
Hi,
I was wondering if you could provide GTF or BED files with the coordinates of the TEs in the gtf.loc files currently available on your servers? Or do the names and coordinates of each insertion correspond to the names and coordinates used in the GTF file available for TEtranscripts?
Separate question regarding TEsmall (since I'm not sure if the Github for that is being actively maintained): would it be possible to incorporate native support for the newer human (hg38)/mouse (mm10) genomes? It seems to me that you can currently specify mouse/human and the tool will download the corresponding annotation files, but for hg19/mm9. It would be awesome if the tool natively supported the newer genome builds!
Thanks for the help and tools!