locating TEs, esp. in Retained Introns (RI)

mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.

http://hammelllab.labsites.cshl.edu/software/#TEtranscripts

GNU General Public License v3.0

219 stars 29 forks source link

locating TEs, esp. in Retained Introns (RI) #204

Open tud03125 opened 19 hours ago

tud03125 commented 19 hours ago

Hi Everyone,

I was successfully able to run TEtranscript using these codes in command line:

singularity exec tetranscripts.sif TEtranscripts \
    -t /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I3_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I4_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I5_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I6_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I7_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I8_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I9_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I10_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I11_sorted.bam \
    -c /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I1_sorted.bam \
       /home/tud03125/pipeline/mouse_liver_rnasplice/star_salmon/I2_sorted.bam \
    --GTF /home/tud03125/pipeline/Mus_musculus.GRCm39.112.gtf \
    --TE /home/tud03125/pipeline/mm39_rmsk_TE.gtf \
    --sortByPos

However, the boss I am working with is more interested in knowing where these TEs are located so that they can design primers for experimental validation. From TEtranscript, TEcount, or any of your programs, is there a way to map where these TEs are located, esp. in Retained Introns (RI) (most interesting since we are seeking dsRNA strands)?

olivertam commented 19 hours ago

Hi,

Thank you for your interest in the software. We would recommend trying TElocal, which tries to quantify TE at a paricular locus. You will have to run each library separately, then join the output into a single table. You can then run DESeq2 (or any other differential analysis algorithm) to look for differentially expressed loci. You will need to download an indexed database for the TE annotations (available here), with the corresponding genomic location available here

One thing that I would confirm is that you are using the correct annotations, as it appears that your gene GTF (Mus_musculus.GRCm39.112.gtf) might be from Ensembl, while the TE GTF is designed for UCSC. Depending on your alignment, you might have issues with annotations if the chromosome names do not match.

Thanks

tud03125 commented 19 hours ago

@olivertam About this one:

One thing that I would confirm is that you are using the correct annotations, as it appears that your gene GTF (Mus_musculus.GRCm39.112.gtf) might be from Ensembl, while the TE GTF is designed for UCSC. Depending on your alignment, you might have issues with annotations if the chromosome names do not match.

Yeah, I've noticed. But, I couldn't find the Ensembl version of TE GTF. If you could find it for me, that'd be great since these BAM files were made using that Mus_musculus.GRCm39.112.gtf.

olivertam commented 19 hours ago

Hi,

It is available here.

Thanks.

tud03125 commented 18 hours ago

@olivertam Thanks! Very helpful!

Two questions:

for this one:

You will need to download an indexed database for the TE annotations (available here), with the corresponding genomic location available here

For the annotation's table: https://www.dropbox.com/scl/fo/jdpgn6fl8ngd3th3zebap/ACJd90BZ2kcgQQl_Yo70xcM/TElocal/annotation_tables?rlkey=41oz6ppggy82uha5i3yo1rnlx&e=2&subfolder_nav_tracking=1&dl=0, for GRCm39, there's only GENCODE. The only Ensembl I found is in GRCm38. Do you know when I'll get one from Ensembl for GRCm39?

Also, for the "corresponding genomic location," where or how do I use it in TElocal? Can it also be used in TEtranscript too (looking at TElocal, the format looks very similar to TEtranscript)?

olivertam commented 17 hours ago

Hi,

The GRCm39 Ensembl genomic locations are available here.

The genomic location is generated from the GTF used in TEtranscripts, so when you get the output from TElocal, the transcript_id should match up with the start of the row names. Unfortunately, since TEtranscripts aggregates the information when quantifying, so you won't be able to assess expression per locus.

Thanks

tud03125 commented 12 hours ago

@olivertam So, I'm starting to run TElocal. For the TE annotation for that software, it only accepts those with .locInd at the end (from TE annotation file needs to be a TElocal index, which will end in .locInd(base) message). From your links, the closet I could get, with Ensemble and GRCm39 genome, the one I've used is GRCm39_Ensembl_rmsk_TE.gtf.locInd. That said, I'm having a hard time viewing contents of this file (head, tail or cat just shows gibberish, unreadable characters). Is there a way to view GRCm39_Ensembl_rmsk_TE.gtf.locInd file, or no need since it's the same as GRCm39_Ensembl_rmsk_TE.gtf?

olivertam commented 12 hours ago

Hi,

The locInd file is in binary, so you can't read it as a text file. It is basically generated from the TE GTF, but stored as a pickle index that allows easy loading (since generating this index takes quite a long time, so not great if it needs to be created everytime).

Thanks.

tud03125 commented 11 hours ago

@olivertam One, other question: the GRCm39_Ensembl_rmsk_TE.gtf.locInd.locations file you gave me, which includes these information: TE chromosome:start-stop:strand, is very helpful for mapping purposes, esp. since TEtranscript and TElocal just quantitates how many Genes and TEs are there. But, how do you use GRCm39_Ensembl_rmsk_TE.gtf.locInd.locations file for this mapping case, especially since TElocal states TE annotation file needs to be a TElocal index, which will end in .locInd(base), thus I can only use GRCm39_Ensembl_rmsk_TE.gtf.locInd and not GRCm39_Ensembl_rmsk_TE.gtf.locInd.locations in TElocal, and I am not sure how to use it for TEtranscript either?

olivertam commented 10 hours ago

Hi,

When you quantify using TElocal, you will get a count table for each of the TE locus, named according to the name used in the locInd.locations file. After differential analysis, you will be able to determine the genomic location of any loci that is differentially expressed and see if there are genomic features of interest (e.g. in retained introns).

Thanks.