Closed JihedC closed 3 years ago
Hi Jihed,
Thank you for your interest in the software. Looking at your command line, it appears that you're working with a GENCODE gene GTF, but an "Ensembl" TE GTF, and you mentioned that you're using mm10 (which I typically associate with UCSC). The annoying thing is that the three systems have slightly different ways of naming their chromosomes, and there aren't many tools that would be able to automatically inter-convert between them. My suspicion is that there is a mismatch between your alignment and GTF files, and thus you're not actually quantifying TE.
Thus, I would recommend double-checking that your genome alignments matches either your gene GTF or TE GTF, and alter the other one accordingly. If you need the GENCODE version of the TE GTF (for GRCm38), it has just been added.
Brief notes:
UCSC and GENCODE uses the "chr" prefix for all of the "canonical" chromosomes (chr 1-19, X Y), and uses chrM for mitochondria. Ensembl drops the "chr" prefix, and uses MT for mitochondria.
Ensembl and GENCODE uses the "original" accession ID for scaffolds and alternate haplotypes, while UCSC adds prefixes (e.g. chrU_
), and suffixes (e.g. _alt
or _random
) to them.
To address your second comment:
I was hoping to obtain coordinates and name of the TE differentially expressed ...
TEtranscripts is designed to aggregate TE expression across sub-families (e.g. IAP-Ez-int) from multiple copies of the genome. Thus, you would not get the precise copy of the TE that is differentially expressed, just the sub-familiy, but we have found that this is sufficient for many analyses.
If you are need a TE locus quantification, we would recommend trying TElocal
, which is in beta. You will need to download a pre-built index from this location, though we currently only have UCSC mm10 available.
Hope this answers some of your questions. Please feel free to respond if there are further questions or issues. Thanks
P.S. In case you are not aware, you can check the chromosome names of your BAM file by running the following: samtools view -H [Your BAM file]
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr2 LN:242193529
...
Hi @olivertam, Thanks for you detailled answer. It was indeed a problem of compatibility between the bam file and the TE GTF. It works with the one you provided.
I am definitely interested in the TElocal
tool, thanks for mentioning it :) It will help me to overlap TE expression with my ATAC-seq data!
Thanks again!
Hello,
Thank you for the tool, it's well documented and easy to use. I come to you with a question because I am not sure of understanding the output files correctly.
I mapped paired-end RNA-seq data to mm10 (3 biological replicates) (
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.annotation.gtf.gz
) and sorted the bam files.I then used the following command line for TE_transcript:
The count table look like this :
And the
gene_TE_analysis.txt
file look like this :I was hoping to obtain coordinates and name of the TE differentially expressed but it seems that I only got genes. It's not clear to me why I get a table with this information since my
--TE GRCm38_Ensembl_rmsk_TE.gtf
contain information about chr, start, end and id of the TE.Does anyone have an idea of what I am doing wrong?
Thanks in advance for your help!
Jihed