mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Error in building TE index with my TE.gtf file #140

Closed biomesser closed 12 months ago

biomesser commented 1 year ago

Hello! I created TE.gtf file for use with TEtranscripts. I get error "Error in building TE index "

Please, check my TE.gtf What is wrong in my annotations?

Thanks!! sltk_bed_mod.gtf.zip

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software.

TEtranscripts require the TE GTF to have the following information and attributes: 1) Follow the GTF format. For example, columns 4 should be the start position, but it is taken by the TE name instead. I think moving column 4 should make it GTF compliant. 2) Column 3 should be "exon", as this is how the software treats each TE copy. 3) In the attributes section (column 9), it requires the following tags:

Here is the first line in your GTF:

1       TE      .       AT1TE00010      11896   11976   .       +       .      sub_family "ATCOPIA24"; family "ATCOPIA24"; super_family "Copia";

Here is how I would fix it:

1       TE      .       11896   11976   .       +       .       gene_id "ATCOPIA24"; transcript_id "AT1TE00010"; family_id "Copia"; class_id "LTR";

Please let me know if you have further questions.

Thanks.

biomesser commented 1 year ago

Hl, Olivertam!! Thanks for you ansver! Now my TE.gtf is valid.

Also, let me ask please: Previously I used TAIR10TE.gtf from https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/TAIR10_TE.gtf.gz. My .bam files have high expressions of AT3G32415 (I checked it by IGV browser), but TE expression counting not found this TE. What could be the reason?

olivertam commented 1 year ago

Hi,

If you look up AT3G32415 on Arabidopsis.org, you would find that it's associated with the TE, AT3TE54550. That might be the entry that you need to look at in the TE output file, as the TAIR10 TE annotations follow the ATxTExxxxx rather than the gene annotation (ATxGxxxxx).

Thanks.

biomesser commented 1 year ago

Yes, my count output file contain ATxGxxxxx ID-s only. I think, TEtranscripts convert ATxTE ID-s to ATxG ID-s. But, i got only one counts-file (and DESeq2 file). I do`t have output file with ATxTE ID-s.

Thanks!!

olivertam commented 1 year ago

Hi,

Sorry, now I understand. TEtranscripts aggregate TE counts by element type (e.g. subfamily). Thus, AT3TE54550 is aggregated into ATCOPIA78. The ATxTE ID is for each copy, which is used for initial annotation, but the counts are then aggregated by the sub-family (gene_id)

Thanks.

Thanks.

biomesser commented 1 year ago

Thanks for you ansvers!!! That is my output file. It contein the ATCOPIA78. biomesser_TE_GTF_FROM_TEtran.cntTable.zip

olivertam commented 1 year ago

Hi,

The results you saw for AT3G32415 is probably described on this line:

gene/TE Heat_16_h_161   Heat_16_h_162   Wt_1    Wt_2
ATCOPIA78:Copia:LTR  451253  395844  205     834

Thanks.

biomesser commented 1 year ago

Yes, i understand it!! Thanks very match!

biomesser commented 1 year ago

Also, let me ask please, do you have .gtf-version (for use with TEtranscripts) of this .bed file TE annotations from https://raw.githubusercontent.com/KaushikPanda1/AthalianaTETranscripts/master/Panda_AT-TEs_annotation_v1.0.bed ? This is .bed file, that i had tried convert at my first post.

Thanks!

olivertam commented 1 year ago

Hi,

Here is the gzipped TE GTF of the BED annotation. The code to generate this is:

$ sed '1d' Panda_AT-TEs_annotation_v1.0.bed | \
   awk -v OFS="{tab}" '{print $1, "user_provided", "exon", $2+1, $3, $5, $6, ".", "gene_id \"" $7 "\"; transcript_id \"" $4 "\"; family_id \"" $8 "\"; class_id \"" $9 "\";"}' \
   sort -k1,1 -k4,5n \
   > Panda_AT-TEs_annotation_v1.0.gtf

Thanks.

github-actions[bot] commented 12 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

biomesser commented 11 months ago

Sorry, now I understand. TEtranscripts aggregate TE counts by element type (e.g. subfamily). Thus, AT3TE54550 is aggregated into ATCOPIA78. The >ATxTE ID is for each copy, which is used for initial annotation, but the counts are then aggregated by the sub-family >(gene_id)

Hi Oliver!!

May I please ask how TEtranscripts compares id AT3G32415 with id AT3TE54550 and with gene_id/family_id ? Where can i get the table file with this information?

I need to match id, like AT3G32415, AT1G01010 ... from my .cntTable with their families/subfamilys.

Thanks!!

olivertam commented 11 months ago

Hi,

Actually, that information is from TAIR (see here for an example). If you need to match gene ID with TE, you can look at this file, and look for lines that has the following:

Note=transposable_element_gene

You can then look up the TE info in this file to see which TE subfamily it is.

Thanks.

biomesser commented 11 months ago

Hi,

The results you saw for AT3G32415 is probably described on this line:

gene/TE Heat_16_h_161   Heat_16_h_162   Wt_1    Wt_2
ATCOPIA78:Copia:LTR  451253  395844  205     834

Thanks.

Hi! I would like to report a possible error in your answer. When using your file.gtf , the program works without error messages. But the resulting file has zeros for all mobile elements. I corrected the column with the chromosome in your file.gtf (leaving only the number, removing the "Chr"). After that, the program correctly annotates mobile elements. I decided to write about it just in case for other readers and users.

Thanks!

biomesser commented 11 months ago

I would like to ask, how much does it matter which files are specified as control (with the -c option) and experience (with the -t option)? As I understand it, this is of fundamental importance only for data processing using DESeq2.

olivertam commented 11 months ago

Hi, The results you saw for AT3G32415 is probably described on this line:

gene/TE Heat_16_h_161   Heat_16_h_162   Wt_1    Wt_2
ATCOPIA78:Copia:LTR  451253  395844  205     834

Thanks.

Hi! I would like to report a possible error in your answer. When using your file.gtf , the program works without error messages. But the resulting file has zeros for all mobile elements. I corrected the column with the chromosome in your file.gtf (leaving only the number, removing the "Chr"). After that, the program correctly annotates mobile elements. I decided to write about it just in case for other readers and users.

Thanks!

Hi,

Thanks for your feedback. Yes, mismatch between chromosome names is a major reason why there might be zero counts for all TE/genes. Unfortunately, we cannot control what nomenclature is used for the chromosome names, so we depend on either a given convention (e.g. UCSC vs GENCODE vs Ensembl) or the user to ensure that they match. Your fix of correcting the GTF is the right approach.

Thanks.

olivertam commented 11 months ago

I would like to ask, how much does it matter which files are specified as control (with the -c option) and experience (with the -t option)? As I understand it, this is of fundamental importance only for data processing using DESeq2.

Hi,

You are correct that the importance of -c and -t applies only the differential analysis (e.g. DESeq2). In that case, the libraries in -c are used as the "reference" for the -t to compare against. Thus, values such as log2 fold change are calculated as log2(treatment value / control value). Thus the libraries used for each one would change the sign (i.e. directionality) of the value.

Thanks