output TE_analysis with ENSMUS codes

Hello,

Thank you for the tool, it's well documented and easy to use. I come to you with a question because I am not sure of understanding the output files correctly.

I mapped paired-end RNA-seq data to mm10 (3 biological replicates) (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.annotation.gtf.gz) and sorted the bam files.

I then used the following command line for TE_transcript:

TEtranscripts -t results/mapped/KO1/KO1.sorted.bam \
results/mapped/KO2/KO2.sorted.bam \
results/mapped/KO3/KO3.sorted.bam \
-c results/mapped/WT1/WT1.sorted.bam \
results/mapped/WT2/WT2.sorted.bam \
results/mapped/WT3/WT3.sorted.bam \
--TE GRCm38_Ensembl_rmsk_TE.gtf \
--stranded no \
--format BAM --mode multi \
--GTF gencode.vM20.annotation.gtf  \
--project test_RNA

The count table look like this :

ene/TE results/mapped/KO1/KO1.sorted.bam.T     results/mapped/KO2/KO2.sorted.bam.T     results/mapped/KO3/KO3.sorted.bam.T     results/mapped/WT1/WT1.sorted.bam.C     results/mapped/WT2/WT2.sorted.bam.C     results/mapped/WT3/WT3.sorted.bam.C
"ENSMUSG00000000001.4"  1372    1416    1624    1423    1292    1527
"ENSMUSG00000000003.15" 0       0       0       0       0       0
"ENSMUSG00000000028.15" 50      66      76      79      59      57
"ENSMUSG00000000031.16" 709     321     262     79      515     25
"ENSMUSG00000000037.16" 26      17      40      131     16      171
"ENSMUSG00000000049.11" 1       0       0       0       0       0

And the gene_TE_analysis.txt file look like this :


baseMean        log2FoldChange  lfcSE   stat    pvalue  padj
ENSMUSG00000000001.4    1435.11018415518        0.0698492267212 0.103874931481332       0.672435839187562       0.501306273292323       0.984879485494981
ENSMUSG00000000028.15   64.1543262163422        -0.0104096039033164     0.285085282912991       -0.0365139995897069     0.970872516095246       0.999987395681906
ENSMUSG00000000031.16   320.976708429315        1.0565817741307 1.18923896899128        0.88845202829747        0.374297649410238       0.984879485494981
ENSMUSG00000000037.16   65.6419206021165        -1.92632402661671       0.902886033330908       -2.13351846800659       0.0328822183176463      0.447240830680681
ENSMUSG00000000056.7    68.241596732665 0.184525215640253       0.360781996073131       0.511459046317957       0.609029658222439       0.984879485494981

I was hoping to obtain coordinates and name of the TE differentially expressed but it seems that I only got genes. It's not clear to me why I get a table with this information since my --TE GRCm38_Ensembl_rmsk_TE.gtf contain information about chr, start, end and id of the TE.

Does anyone have an idea of what I am doing wrong?

Thanks in advance for your help!

Jihed

Hi Jihed,

Thank you for your interest in the software. Looking at your command line, it appears that you're working with a GENCODE gene GTF, but an "Ensembl" TE GTF, and you mentioned that you're using mm10 (which I typically associate with UCSC). The annoying thing is that the three systems have slightly different ways of naming their chromosomes, and there aren't many tools that would be able to automatically inter-convert between them. My suspicion is that there is a mismatch between your alignment and GTF files, and thus you're not actually quantifying TE.

Thus, I would recommend double-checking that your genome alignments matches either your gene GTF or TE GTF, and alter the other one accordingly. If you need the GENCODE version of the TE GTF (for GRCm38), it has just been added.

Brief notes: UCSC and GENCODE uses the "chr" prefix for all of the "canonical" chromosomes (chr 1-19, X Y), and uses chrM for mitochondria. Ensembl drops the "chr" prefix, and uses MT for mitochondria. Ensembl and GENCODE uses the "original" accession ID for scaffolds and alternate haplotypes, while UCSC adds prefixes (e.g. chrU_), and suffixes (e.g. _alt or _random) to them.

To address your second comment:

I was hoping to obtain coordinates and name of the TE differentially expressed ...

TEtranscripts is designed to aggregate TE expression across sub-families (e.g. IAP-Ez-int) from multiple copies of the genome. Thus, you would not get the precise copy of the TE that is differentially expressed, just the sub-familiy, but we have found that this is sufficient for many analyses. If you are need a TE locus quantification, we would recommend trying TElocal, which is in beta. You will need to download a pre-built index from this location, though we currently only have UCSC mm10 available.

Hope this answers some of your questions. Please feel free to respond if there are further questions or issues. Thanks

P.S. In case you are not aware, you can check the chromosome names of your BAM file by running the following: samtools view -H [Your BAM file]

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr2 LN:242193529
...

Hi @olivertam, Thanks for you detailled answer. It was indeed a problem of compatibility between the bam file and the TE GTF. It works with the one you provided.

I am definitely interested in the TElocal tool, thanks for mentioning it :) It will help me to overlap TE expression with my ATAC-seq data!

Thanks again!

mhammell-laboratory / TEtranscripts

output TE_analysis with ENSMUS codes #85