Closed laijen000 closed 4 years ago
Hi,
You can find the GTF that was used to build the TElocal index for mm10 here. The gene_id should match up with the name in the count table (.cntTable
file).
Please let me know if you encounter any issues.
Thanks.
Hi! Yes, thanks a lot for this tool! We use it a lot!
Unfortunately the link does not seem to exist anymore. And I noticed that the gtf file here: https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/mm10_rmsk_TE.gtf.gz is not identical with the gtf used for the TElocal index. Best wishes, Elin
Hi,
Thank you for pointing this out. The link has been replaced with this, which is a text file containing the genomic location of each TE copy in the TElocal index. It is unclear what you mean by "[mm10 TEtranscripts GTF] is not identical with the gtf used for the TElocal index." I checked the first 20 annotations between the text file and the GTF, and they appear to match.
Thanks.
Hi,
Thanks for the new link!
Yes, most entries are identical, but there are 140 small differences in the naming, I think all related to the family. See below for a couple of examples:
Name in prebuilt TELocal index: Ricksha:Ricksha:MULE-MuDR:DNA MER96B_dup307:MER96B:hAT:DNA MamRep137_dup149:MamRep137:TcMar:DNA
Entry in TEtranscript gtf ( mm10_rmsk_TE.gtf): chr1 mm10_rmsk exon 69851100 69851376 2638 + . gene_id "Ricksha"; transcript_id "Ricksha"; family_id "MuDR"; class_id "DNA”; chr12 mm10_rmsk exon 118623463 118623588 313 - . gene_id "MER96B"; transcript_id "MER96B_dup307"; family_id "hAT-Tip100"; class_id "DNA"; chr7 mm10_rmsk exon 113772853 113772942 246 - . gene_id "MamRep137"; transcript_id "MamRep137_dup149"; family_id "TcMar-Tigger"; class_id "DNA";
Entry in mm10_rmsk_TE.gtf.locInd.gtf: Ricksha:Ricksha:MULE-MuDR:DNA chr1:69851100-69851376 MER96B_dup307:MER96B:hAT:DNA chr12:118623463-118623588 MamRep137_dup149:MamRep137:TcMar:DNA chr7:113772853-113772942
That is, in the gtf file used for the TELocal index the families were set to MULE-MuDR, hAT and TcMar, whereas in the TEtranscript gtf the families are MuDR, hAT-Tip100, TcMar-Tigger. The 6 different family names are all present in both gtf files for the absolute majority of TEs the family assignment is the same.
From this I concluded that the mm10_rmsk_TE.gtf from TEtranscript is not identical to the file used for the building of the TELocal index. Although very similar!
Best, Elin
Elin Axelsson-Ekker Bioinformatician
Gregor Mendel Institute of Molecular Plant Biology GmbH Dr. Bohr-Gasse 3, 1030 Vienna, Austria Phone: +43 1 79044 9814 @.*** http://www.gmi.oeaw.ac.at
On 14.06.2022, at 15:06, Oliver Tam @.***> wrote:
Hi,
Thank you for pointing this out. The link has been replaced with this, which is a text file containing the genomic location of each TE copy in the TElocal index. It is unclear what you mean by "[mm10 TEtranscripts GTF] is not identical with the gtf used for the TElocal index." I checked the first 20 annotations between the text file and the GTF, and they appear to match.
Thanks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.
Hi Elin,
Thank you for looking into it.
Yes, you are correct that the family_id seems to be the cause of the difference. This has been previously seen in multiple GTF from UCSC (see this TEtranscripts issue), and I suspect that there might have been a slight change/update to the UCSC RepeatMasker track at some point.
However, I have checked that the transcript id (the first portion of the final annotation as delimited by :
) has identical genomic locations between the TEtranscripts GTF and the TElocal pre-built index, so while they are not identical, the differences will not alter the quantification.
Thanks.
Thank you for TElocal. I was able to obtain the .cntTable containing TE expression on a locus level. However, I am wondering how to map a specific TE locus back to genomic coordinates (mm10)? Thank you for the help!