mhammell-laboratory / TElocal

A package for quantifying transposable elements at a locus level for RNAseq datasets.
GNU General Public License v3.0
21 stars 8 forks source link

Mapping TE loci back to genome #6

Closed laijen000 closed 4 years ago

laijen000 commented 4 years ago

Thank you for TElocal. I was able to obtain the .cntTable containing TE expression on a locus level. However, I am wondering how to map a specific TE locus back to genomic coordinates (mm10)? Thank you for the help!

olivertam commented 4 years ago

Hi,

You can find the GTF that was used to build the TElocal index for mm10 here. The gene_id should match up with the name in the count table (.cntTable file). Please let me know if you encounter any issues.

Thanks.

ax-ekk commented 2 years ago

Hi! Yes, thanks a lot for this tool! We use it a lot!

Unfortunately the link does not seem to exist anymore. And I noticed that the gtf file here: https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/mm10_rmsk_TE.gtf.gz is not identical with the gtf used for the TElocal index. Best wishes, Elin

olivertam commented 2 years ago

Hi,

Thank you for pointing this out. The link has been replaced with this, which is a text file containing the genomic location of each TE copy in the TElocal index. It is unclear what you mean by "[mm10 TEtranscripts GTF] is not identical with the gtf used for the TElocal index." I checked the first 20 annotations between the text file and the GTF, and they appear to match.

Thanks.

ax-ekk commented 2 years ago

Hi,

Thanks for the new link!

Yes, most entries are identical, but there are 140 small differences in the naming, I think all related to the family. See below for a couple of examples:

Name in prebuilt TELocal index: Ricksha:Ricksha:MULE-MuDR:DNA MER96B_dup307:MER96B:hAT:DNA MamRep137_dup149:MamRep137:TcMar:DNA

Entry in TEtranscript gtf ( mm10_rmsk_TE.gtf): chr1 mm10_rmsk exon 69851100 69851376 2638 + . gene_id "Ricksha"; transcript_id "Ricksha"; family_id "MuDR"; class_id "DNA”; chr12 mm10_rmsk exon 118623463 118623588 313 - . gene_id "MER96B"; transcript_id "MER96B_dup307"; family_id "hAT-Tip100"; class_id "DNA"; chr7 mm10_rmsk exon 113772853 113772942 246 - . gene_id "MamRep137"; transcript_id "MamRep137_dup149"; family_id "TcMar-Tigger"; class_id "DNA";

Entry in mm10_rmsk_TE.gtf.locInd.gtf: Ricksha:Ricksha:MULE-MuDR:DNA chr1:69851100-69851376 MER96B_dup307:MER96B:hAT:DNA chr12:118623463-118623588 MamRep137_dup149:MamRep137:TcMar:DNA chr7:113772853-113772942

That is, in the gtf file used for the TELocal index the families were set to MULE-MuDR, hAT and TcMar, whereas in the TEtranscript gtf the families are MuDR, hAT-Tip100, TcMar-Tigger. The 6 different family names are all present in both gtf files for the absolute majority of TEs the family assignment is the same.

From this I concluded that the mm10_rmsk_TE.gtf from TEtranscript is not identical to the file used for the building of the TELocal index. Although very similar!

Best, Elin

Elin Axelsson-Ekker Bioinformatician

Gregor Mendel Institute of Molecular Plant Biology GmbH Dr. Bohr-Gasse 3, 1030 Vienna, Austria Phone: +43 1 79044 9814 @.*** http://www.gmi.oeaw.ac.at

On 14.06.2022, at 15:06, Oliver Tam @.***> wrote:

Hi,

Thank you for pointing this out. The link has been replaced with this, which is a text file containing the genomic location of each TE copy in the TElocal index. It is unclear what you mean by "[mm10 TEtranscripts GTF] is not identical with the gtf used for the TElocal index." I checked the first 20 annotations between the text file and the GTF, and they appear to match.

Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

olivertam commented 2 years ago

Hi Elin,

Thank you for looking into it. Yes, you are correct that the family_id seems to be the cause of the difference. This has been previously seen in multiple GTF from UCSC (see this TEtranscripts issue), and I suspect that there might have been a slight change/update to the UCSC RepeatMasker track at some point. However, I have checked that the transcript id (the first portion of the final annotation as delimited by :) has identical genomic locations between the TEtranscripts GTF and the TElocal pre-built index, so while they are not identical, the differences will not alter the quantification.

Thanks.