Open tud03125 opened 19 hours ago
Hi,
Thank you for your interest in the software. We would recommend trying TElocal, which tries to quantify TE at a paricular locus. You will have to run each library separately, then join the output into a single table. You can then run DESeq2 (or any other differential analysis algorithm) to look for differentially expressed loci. You will need to download an indexed database for the TE annotations (available here), with the corresponding genomic location available here
One thing that I would confirm is that you are using the correct annotations, as it appears that your gene GTF (Mus_musculus.GRCm39.112.gtf) might be from Ensembl, while the TE GTF is designed for UCSC. Depending on your alignment, you might have issues with annotations if the chromosome names do not match.
Thanks
@olivertam About this one:
One thing that I would confirm is that you are using the correct annotations, as it appears that your gene GTF (Mus_musculus.GRCm39.112.gtf) might be from Ensembl, while the TE GTF is designed for UCSC. Depending on your alignment, you might have issues with annotations if the chromosome names do not match.
Yeah, I've noticed. But, I couldn't find the Ensembl version of TE GTF. If you could find it for me, that'd be great since these BAM files were made using that Mus_musculus.GRCm39.112.gtf.
@olivertam Thanks! Very helpful!
Two questions:
You will need to download an indexed database for the TE annotations (available here), with the corresponding genomic location available here
For the annotation's table: https://www.dropbox.com/scl/fo/jdpgn6fl8ngd3th3zebap/ACJd90BZ2kcgQQl_Yo70xcM/TElocal/annotation_tables?rlkey=41oz6ppggy82uha5i3yo1rnlx&e=2&subfolder_nav_tracking=1&dl=0, for GRCm39, there's only GENCODE. The only Ensembl I found is in GRCm38. Do you know when I'll get one from Ensembl for GRCm39?
Hi,
The GRCm39 Ensembl genomic locations are available here.
The genomic location is generated from the GTF used in TEtranscripts
, so when you get the output from TElocal
, the transcript_id
should match up with the start of the row names. Unfortunately, since TEtranscripts
aggregates the information when quantifying, so you won't be able to assess expression per locus.
Thanks
@olivertam So, I'm starting to run TElocal
. For the TE annotation for that software, it only accepts those with .locInd
at the end (from TE annotation file needs to be a TElocal index, which will end in .locInd(base)
message). From your links, the closet I could get, with Ensemble and GRCm39 genome, the one I've used is GRCm39_Ensembl_rmsk_TE.gtf.locInd
. That said, I'm having a hard time viewing contents of this file (head
, tail
or cat
just shows gibberish, unreadable characters). Is there a way to view GRCm39_Ensembl_rmsk_TE.gtf.locInd
file, or no need since it's the same as GRCm39_Ensembl_rmsk_TE.gtf
?
Hi,
The locInd
file is in binary, so you can't read it as a text file. It is basically generated from the TE GTF, but stored as a pickle index that allows easy loading (since generating this index takes quite a long time, so not great if it needs to be created everytime).
Thanks.
@olivertam One, other question: the GRCm39_Ensembl_rmsk_TE.gtf.locInd.locations
file you gave me, which includes these information: TE chromosome:start-stop:strand
, is very helpful for mapping purposes, esp. since TEtranscript
and TElocal
just quantitates how many Genes and TEs are there. But, how do you use GRCm39_Ensembl_rmsk_TE.gtf.locInd.locations
file for this mapping case, especially since TElocal
states TE annotation file needs to be a TElocal index, which will end in .locInd(base)
, thus I can only use GRCm39_Ensembl_rmsk_TE.gtf.locInd
and not GRCm39_Ensembl_rmsk_TE.gtf.locInd.locations
in TElocal
, and I am not sure how to use it for TEtranscript
either?
Hi,
When you quantify using TElocal
, you will get a count table for each of the TE locus, named according to the name used in the locInd.locations
file. After differential analysis, you will be able to determine the genomic location of any loci that is differentially expressed and see if there are genomic features of interest (e.g. in retained introns).
Thanks.
Hi Everyone,
I was successfully able to run TEtranscript using these codes in command line:
However, the boss I am working with is more interested in knowing where these TEs are located so that they can design primers for experimental validation. From TEtranscript, TEcount, or any of your programs, is there a way to map where these TEs are located, esp. in Retained Introns (RI) (most interesting since we are seeking dsRNA strands)?