mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Error building TE index #167

Closed xxu092 closed 5 months ago

xxu092 commented 6 months ago

Hello!

Thank you for building this amazing tool. I am very excited to try it out. When I try to run TEcount I keep getting error saying Error in building TE index I think I formatted my TE gtf file correctly, this is the snippet

scaffold_1  RepeatMasker    match   3   92  .   -   .   gene_id "scaffold_1_3-92"; transcript_id "scaffold_1_3-92LTR/Gypsy"; family_id "TE_00001165_LTR"; class_id "LTR/Gypsy";
scaffold_1  RepeatMasker    match   58  922 .   -   .   gene_id "scaffold_1_58-922"; transcript_id "scaffold_1_58-922DNA/DTM"; family_id "TE_00001112"; class_id "DNA/DTM";
scaffold_1  RepeatMasker    match   898 3063    .   -   .   gene_id "scaffold_1_898-3063"; transcript_id "scaffold_1_898-3063LTR/Gypsy"; family_id "TE_00001249_INT"; class_id "LTR/Gypsy";
scaffold_1  RepeatMasker    match   3064    4030    .   +   .   gene_id "scaffold_1_3064-4030"; transcript_id "scaffold_1_3064-4030LTR/unknown"; family_id "TE_00000892_INT"; class_id "LTR/unknown";
scaffold_1  RepeatMasker    match   4031    4343    .   -   .   gene_id "scaffold_1_4031-4343"; transcript_id "scaffold_1_4031-4343LTR/Gypsy"; family_id "TE_00001249_INT"; class_id "LTR/Gypsy";

Is my TE gtf file formatted wrong? Do you have any suggestions on how to move forward from here? I also have attached the original gff3 file I used to make this gtf file. Thank you in advance! ZradcleanRm1.gff3.zip

olivertam commented 6 months ago

Hi,

Thank you for your interest in the software. I think the issue is that the third column is match, whereas the program expects it to be exon (to mimic gene GTF nomenclature). I might change the gene_id to what is currently family_id, and change the family_id to the portion after the / in the class_id. You can keep the transcript_id the same. So in your example, it would look like this:

scaffold_1  RepeatMasker    exon    3   92  .   -   .   gene_id "TE_00001165_LTR"; transcript_id "scaffold_1_3-92LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1  RepeatMasker    exon    58  922 .   -   .   gene_id "TE_00001112"; transcript_id "scaffold_1_58-922DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_1  RepeatMasker    exon    898 3063    .   -   .   gene_id "TE_00001249_INT"; transcript_id "scaffold_1_898-3063LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1  RepeatMasker    exon    3064    4030    .   +   .   gene_id "TE_00000892_INT"; transcript_id "scaffold_1_3064-4030LTR/unknown"; family_id "unknown"; class_id "LTR";
scaffold_1  RepeatMasker    exon    4031    4343    .   -   .   gene_id "TE_00001249_INT"; transcript_id "scaffold_1_4031-4343LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";

Let us know if you still encounter any issue, and we can try to troubleshoot further.

Thanks.

xxu092 commented 6 months ago

Hi Oliver,

Thank you very much for your suggestion, I modified TE gtf file as below but I am still getting the same error.

scaffold_1  RepeatMasker    exon    3   92  .   -   .   gene_id "TE_00001165_LTR"; transcript_id "scaffold_1_3-92LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1  RepeatMasker    exon    58  922 .   -   .   gene_id "TE_00001112"; transcript_id "scaffold_1_58-922DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_1  RepeatMasker    exon    898 3063    .   -   .   gene_id "TE_00001249_INT"; transcript_id "scaffold_1_898-3063LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1  RepeatMasker    exon    3064    4030    .   +   .   gene_id "TE_00000892_INT"; transcript_id "scaffold_1_3064-4030LTR/unknown"; family_id "unknown"; class_id "LTR";

This is the full log file.

/rhome/xxu092/.conda/envs/TEtranscripts/bin/TEcount:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('TEtranscripts==2.2.3', 'TEcount')
INFO  @ Mon, 18 Dec 2023 15:44:54: 
# ARGUMENTS LIST:
# name = Zrad
# BAM file = /rhome/xxu092/bigdata/TEtranscripts/results/ZradAligned.out.bam
# GTF file = /rhome/xxu092/bigdata/TEtranscripts/Geneanno/Zoophthora_radicans_ATCC_208865.gtf 
# TE file = /rhome/xxu092/bigdata/TEtranscripts/TEanno/Zradclean.rm.gtf 
# multi-mapper mode = multi 
# stranded = no 
# number of iteration = 100
# Alignments grouped by read ID = True

INFO  @ Mon, 18 Dec 2023 15:44:54: Processing GTF files ... 

INFO  @ Mon, 18 Dec 2023 15:44:54: Building gene index ....... 

INFO  @ Mon, 18 Dec 2023 15:45:18: Done building gene index ...... 

INFO  @ Mon, 18 Dec 2023 15:45:23: Building TE index ....... 

Error in building TE index 
olivertam commented 6 months ago

Hi,

Could you provide your modified GTF? We can try to troubleshoot it.

Thanks.

xxu092 commented 6 months ago

Zradclean.rm.zip Thank you for your reply. This is my modified GTF file. Thank you so much for your time!

olivertam commented 6 months ago

Hi,

I opened your GTF, and noticed that it's truncated (the last line is incomplete).

scaffold_55     RepeatMasker    exon    299660  299874  .       -       .       gene_id "TE_00001121_INT"; transcript_id "scaffold_55_299660-299874LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_55     RepeatMasker    exon    299875  300373  .       +       .       gene_id "TE_00000983_INT"; transcript_id "scaffold_55_2998

Could you confirm that you sent the full file to me?

Thanks.

xxu092 commented 6 months ago

My apologies, the file I sent you was not the full file. I am attaching the full file here. Thank you so much. Zradclean.rm.zip

olivertam commented 6 months ago

Hi,

Thank you for your GTF. I noticed that there are a lot of overlapping annotations, which is not something that we typically see for most RepeatMasker annotations. This is probably what is throwing the error, as the index builder is not expecting two different TE annotation for the same genomic region. E.g.

scaffold_1      RepeatMasker    exon    3       92      .       -       .       gene_id "TE_00001165_LTR"; transcript_id "scaffold_1_3-92LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1      RepeatMasker    exon    58      922     .       -       .       gene_id "TE_00001112"; transcript_id "scaffold_1_58-922DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_1      RepeatMasker    exon    898     3063    .       -       .       gene_id "TE_00001249_INT"; transcript_id "scaffold_1_898-3063LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";

I have heard that RepeatMasker now attempts to generate nested repeat annotations. In those cases, you would definitely want to process those more carefully, as not the entire genomic interval belongs to the annotated TE. It might be best if you could remove those overlapping annotations, or at least try to trim them to not overlap.

Thanks.

xxu092 commented 6 months ago

Hi Oliver,

Thank you so much for your suggestions. I have processed the repeatmasker file to remove all overlapping repeats but I am still getting the same error. This is the log file. Can the error come from pkg_resources was deprecated? I have also attached my updated gtf file. I appreciate your help.

/rhome/xxu092/.conda/envs/TEtranscripts/bin/TEcount:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('TEtranscripts==2.2.3', 'TEcount')
INFO  @ Tue, 19 Dec 2023 14:55:08: 
# ARGUMENTS LIST:
# name = Zrad
# BAM file = /rhome/xxu092/bigdata/TEtranscripts/results/ZradAligned.out.bam
# GTF file = /rhome/xxu092/bigdata/TEtranscripts/Geneanno/Zoophthora_radicans_ATCC_208865.gtf 
# TE file = /rhome/xxu092/bigdata/TEtranscripts/TEanno/Zradclean.edta.gtf 
# multi-mapper mode = multi 
# stranded = no 
# number of iteration = 100
# Alignments grouped by read ID = True

INFO  @ Tue, 19 Dec 2023 14:55:08: Processing GTF files ... 

INFO  @ Tue, 19 Dec 2023 14:55:08: Building gene index ....... 

INFO  @ Tue, 19 Dec 2023 14:55:36: Done building gene index ...... 

INFO  @ Tue, 19 Dec 2023 14:55:41: Building TE index ....... 

Error in building TE index 

Zradclean.edta.gtf.zip

olivertam commented 6 months ago

Hi,

Thank you for your patience. I took a close look at the GTF, and found the following lines which looks suspicious:

$ grep -e "e+" Zradclean.edta.gtf
scaffold_95     EDTA    exon    299544  3e+05   1057    +       .       gene_id "TE_00000715_LTR"; transcript_id "scaffold_95_299544-3e+05LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_126    EDTA    exon    599359  6e+05   4458    +       .       gene_id "TE_00001200_LTR"; transcript_id "scaffold_126_599359-6e+05LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_157    EDTA    exon    499658  5e+05   1271    +       .       gene_id "TE_00000781"; transcript_id "scaffold_157_499658-5e+05DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_174    EDTA    exon    599720  6e+05   762     -       .       gene_id "TE_00000934"; transcript_id "scaffold_174_599720-6e+05DNA/DTH"; family_id "DTH"; class_id "DNA";
scaffold_271    EDTA    exon    499537  5e+05   9592    -       .       gene_id "TE_00001102_INT"; transcript_id "scaffold_271_499537-5e+05LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_679    EDTA    exon    2e+05   201482  6883    +       .       gene_id "TE_00000252"; transcript_id "scaffold_679_2e+05-201482LTR/unknown"; family_id "unknown"; class_id "LTR";

It appears that the start/end position was converted to scientific notation. This definitely causes issues with indexer. Once those lines were removed, the error is gone. Actually, it might be worth checking if the original GTF also had these lines, and whether removing them allows the GTF to work.

Thanks again for your patience, and all the best.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days