Closed biomesser closed 12 months ago
Hi,
Thank you for your interest in the software.
TEtranscripts require the TE GTF to have the following information and attributes: 1) Follow the GTF format. For example, columns 4 should be the start position, but it is taken by the TE name instead. I think moving column 4 should make it GTF compliant. 2) Column 3 should be "exon", as this is how the software treats each TE copy. 3) In the attributes section (column 9), it requires the following tags:
Here is the first line in your GTF:
1 TE . AT1TE00010 11896 11976 . + . sub_family "ATCOPIA24"; family "ATCOPIA24"; super_family "Copia";
Here is how I would fix it:
1 TE . 11896 11976 . + . gene_id "ATCOPIA24"; transcript_id "AT1TE00010"; family_id "Copia"; class_id "LTR";
Please let me know if you have further questions.
Thanks.
Hl, Olivertam!! Thanks for you ansver! Now my TE.gtf is valid.
Also, let me ask please: Previously I used TAIR10TE.gtf from https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/TAIR10_TE.gtf.gz. My .bam files have high expressions of AT3G32415 (I checked it by IGV browser), but TE expression counting not found this TE. What could be the reason?
Hi,
If you look up AT3G32415 on Arabidopsis.org, you would find that it's associated with the TE, AT3TE54550. That might be the entry that you need to look at in the TE output file, as the TAIR10 TE annotations follow the ATxTExxxxx rather than the gene annotation (ATxGxxxxx).
Thanks.
Yes, my count output file contain ATxGxxxxx ID-s only. I think, TEtranscripts convert ATxTE ID-s to ATxG ID-s. But, i got only one counts-file (and DESeq2 file). I do`t have output file with ATxTE ID-s.
Thanks!!
Hi,
Sorry, now I understand.
TEtranscripts aggregate TE counts by element type (e.g. subfamily). Thus, AT3TE54550 is aggregated into ATCOPIA78. The ATxTE ID is for each copy, which is used for initial annotation, but the counts are then aggregated by the sub-family (gene_id
)
Thanks.
Thanks.
Thanks for you ansvers!!! That is my output file. It contein the ATCOPIA78. biomesser_TE_GTF_FROM_TEtran.cntTable.zip
Hi,
The results you saw for AT3G32415 is probably described on this line:
gene/TE Heat_16_h_161 Heat_16_h_162 Wt_1 Wt_2
ATCOPIA78:Copia:LTR 451253 395844 205 834
Thanks.
Yes, i understand it!! Thanks very match!
Also, let me ask please, do you have .gtf-version (for use with TEtranscripts) of this .bed file TE annotations from https://raw.githubusercontent.com/KaushikPanda1/AthalianaTETranscripts/master/Panda_AT-TEs_annotation_v1.0.bed ? This is .bed file, that i had tried convert at my first post.
Thanks!
Hi,
Here is the gzipped TE GTF of the BED annotation. The code to generate this is:
$ sed '1d' Panda_AT-TEs_annotation_v1.0.bed | \
awk -v OFS="{tab}" '{print $1, "user_provided", "exon", $2+1, $3, $5, $6, ".", "gene_id \"" $7 "\"; transcript_id \"" $4 "\"; family_id \"" $8 "\"; class_id \"" $9 "\";"}' \
sort -k1,1 -k4,5n \
> Panda_AT-TEs_annotation_v1.0.gtf
Thanks.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days
Sorry, now I understand. TEtranscripts aggregate TE counts by element type (e.g. subfamily). Thus, AT3TE54550 is aggregated into ATCOPIA78. The >ATxTE ID is for each copy, which is used for initial annotation, but the counts are then aggregated by the sub-family >(gene_id)
Hi Oliver!!
May I please ask how TEtranscripts compares id AT3G32415 with id AT3TE54550 and with gene_id/family_id ? Where can i get the table file with this information?
I need to match id, like AT3G32415, AT1G01010 ... from my .cntTable with their families/subfamilys.
Thanks!!
Hi,
The results you saw for AT3G32415 is probably described on this line:
gene/TE Heat_16_h_161 Heat_16_h_162 Wt_1 Wt_2 ATCOPIA78:Copia:LTR 451253 395844 205 834
Thanks.
Hi! I would like to report a possible error in your answer. When using your file.gtf , the program works without error messages. But the resulting file has zeros for all mobile elements. I corrected the column with the chromosome in your file.gtf (leaving only the number, removing the "Chr"). After that, the program correctly annotates mobile elements. I decided to write about it just in case for other readers and users.
Thanks!
I would like to ask, how much does it matter which files are specified as control (with the -c option) and experience (with the -t option)? As I understand it, this is of fundamental importance only for data processing using DESeq2.
Hi, The results you saw for AT3G32415 is probably described on this line:
gene/TE Heat_16_h_161 Heat_16_h_162 Wt_1 Wt_2 ATCOPIA78:Copia:LTR 451253 395844 205 834
Thanks.
Hi! I would like to report a possible error in your answer. When using your file.gtf , the program works without error messages. But the resulting file has zeros for all mobile elements. I corrected the column with the chromosome in your file.gtf (leaving only the number, removing the "Chr"). After that, the program correctly annotates mobile elements. I decided to write about it just in case for other readers and users.
Thanks!
Hi,
Thanks for your feedback. Yes, mismatch between chromosome names is a major reason why there might be zero counts for all TE/genes. Unfortunately, we cannot control what nomenclature is used for the chromosome names, so we depend on either a given convention (e.g. UCSC vs GENCODE vs Ensembl) or the user to ensure that they match. Your fix of correcting the GTF is the right approach.
Thanks.
I would like to ask, how much does it matter which files are specified as control (with the -c option) and experience (with the -t option)? As I understand it, this is of fundamental importance only for data processing using DESeq2.
Hi,
You are correct that the importance of -c
and -t
applies only the differential analysis (e.g. DESeq2
). In that case, the libraries in -c
are used as the "reference" for the -t
to compare against. Thus, values such as log2 fold change are calculated as log2(treatment value / control value)
. Thus the libraries used for each one would change the sign (i.e. directionality) of the value.
Thanks
Hello! I created TE.gtf file for use with TEtranscripts. I get error "Error in building TE index "
Please, check my TE.gtf What is wrong in my annotations?
Thanks!! sltk_bed_mod.gtf.zip