Closed xxu092 closed 5 months ago
Hi,
Thank you for your interest in the software.
I think the issue is that the third column is match
, whereas the program expects it to be exon
(to mimic gene GTF nomenclature).
I might change the gene_id
to what is currently family_id
, and change the family_id
to the portion after the /
in the class_id
. You can keep the transcript_id
the same.
So in your example, it would look like this:
scaffold_1 RepeatMasker exon 3 92 . - . gene_id "TE_00001165_LTR"; transcript_id "scaffold_1_3-92LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1 RepeatMasker exon 58 922 . - . gene_id "TE_00001112"; transcript_id "scaffold_1_58-922DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_1 RepeatMasker exon 898 3063 . - . gene_id "TE_00001249_INT"; transcript_id "scaffold_1_898-3063LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1 RepeatMasker exon 3064 4030 . + . gene_id "TE_00000892_INT"; transcript_id "scaffold_1_3064-4030LTR/unknown"; family_id "unknown"; class_id "LTR";
scaffold_1 RepeatMasker exon 4031 4343 . - . gene_id "TE_00001249_INT"; transcript_id "scaffold_1_4031-4343LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
Let us know if you still encounter any issue, and we can try to troubleshoot further.
Thanks.
Hi Oliver,
Thank you very much for your suggestion, I modified TE gtf file as below but I am still getting the same error.
scaffold_1 RepeatMasker exon 3 92 . - . gene_id "TE_00001165_LTR"; transcript_id "scaffold_1_3-92LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1 RepeatMasker exon 58 922 . - . gene_id "TE_00001112"; transcript_id "scaffold_1_58-922DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_1 RepeatMasker exon 898 3063 . - . gene_id "TE_00001249_INT"; transcript_id "scaffold_1_898-3063LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1 RepeatMasker exon 3064 4030 . + . gene_id "TE_00000892_INT"; transcript_id "scaffold_1_3064-4030LTR/unknown"; family_id "unknown"; class_id "LTR";
This is the full log file.
/rhome/xxu092/.conda/envs/TEtranscripts/bin/TEcount:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
__import__('pkg_resources').run_script('TEtranscripts==2.2.3', 'TEcount')
INFO @ Mon, 18 Dec 2023 15:44:54:
# ARGUMENTS LIST:
# name = Zrad
# BAM file = /rhome/xxu092/bigdata/TEtranscripts/results/ZradAligned.out.bam
# GTF file = /rhome/xxu092/bigdata/TEtranscripts/Geneanno/Zoophthora_radicans_ATCC_208865.gtf
# TE file = /rhome/xxu092/bigdata/TEtranscripts/TEanno/Zradclean.rm.gtf
# multi-mapper mode = multi
# stranded = no
# number of iteration = 100
# Alignments grouped by read ID = True
INFO @ Mon, 18 Dec 2023 15:44:54: Processing GTF files ...
INFO @ Mon, 18 Dec 2023 15:44:54: Building gene index .......
INFO @ Mon, 18 Dec 2023 15:45:18: Done building gene index ......
INFO @ Mon, 18 Dec 2023 15:45:23: Building TE index .......
Error in building TE index
Hi,
Could you provide your modified GTF? We can try to troubleshoot it.
Thanks.
Zradclean.rm.zip Thank you for your reply. This is my modified GTF file. Thank you so much for your time!
Hi,
I opened your GTF, and noticed that it's truncated (the last line is incomplete).
scaffold_55 RepeatMasker exon 299660 299874 . - . gene_id "TE_00001121_INT"; transcript_id "scaffold_55_299660-299874LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_55 RepeatMasker exon 299875 300373 . + . gene_id "TE_00000983_INT"; transcript_id "scaffold_55_2998
Could you confirm that you sent the full file to me?
Thanks.
My apologies, the file I sent you was not the full file. I am attaching the full file here. Thank you so much. Zradclean.rm.zip
Hi,
Thank you for your GTF. I noticed that there are a lot of overlapping annotations, which is not something that we typically see for most RepeatMasker annotations. This is probably what is throwing the error, as the index builder is not expecting two different TE annotation for the same genomic region. E.g.
scaffold_1 RepeatMasker exon 3 92 . - . gene_id "TE_00001165_LTR"; transcript_id "scaffold_1_3-92LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_1 RepeatMasker exon 58 922 . - . gene_id "TE_00001112"; transcript_id "scaffold_1_58-922DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_1 RepeatMasker exon 898 3063 . - . gene_id "TE_00001249_INT"; transcript_id "scaffold_1_898-3063LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
I have heard that RepeatMasker now attempts to generate nested repeat annotations. In those cases, you would definitely want to process those more carefully, as not the entire genomic interval belongs to the annotated TE. It might be best if you could remove those overlapping annotations, or at least try to trim them to not overlap.
Thanks.
Hi Oliver,
Thank you so much for your suggestions. I have processed the repeatmasker file to remove all overlapping repeats but I am still getting the same error. This is the log file. Can the error come from pkg_resources was deprecated? I have also attached my updated gtf file. I appreciate your help.
/rhome/xxu092/.conda/envs/TEtranscripts/bin/TEcount:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
__import__('pkg_resources').run_script('TEtranscripts==2.2.3', 'TEcount')
INFO @ Tue, 19 Dec 2023 14:55:08:
# ARGUMENTS LIST:
# name = Zrad
# BAM file = /rhome/xxu092/bigdata/TEtranscripts/results/ZradAligned.out.bam
# GTF file = /rhome/xxu092/bigdata/TEtranscripts/Geneanno/Zoophthora_radicans_ATCC_208865.gtf
# TE file = /rhome/xxu092/bigdata/TEtranscripts/TEanno/Zradclean.edta.gtf
# multi-mapper mode = multi
# stranded = no
# number of iteration = 100
# Alignments grouped by read ID = True
INFO @ Tue, 19 Dec 2023 14:55:08: Processing GTF files ...
INFO @ Tue, 19 Dec 2023 14:55:08: Building gene index .......
INFO @ Tue, 19 Dec 2023 14:55:36: Done building gene index ......
INFO @ Tue, 19 Dec 2023 14:55:41: Building TE index .......
Error in building TE index
Hi,
Thank you for your patience. I took a close look at the GTF, and found the following lines which looks suspicious:
$ grep -e "e+" Zradclean.edta.gtf
scaffold_95 EDTA exon 299544 3e+05 1057 + . gene_id "TE_00000715_LTR"; transcript_id "scaffold_95_299544-3e+05LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_126 EDTA exon 599359 6e+05 4458 + . gene_id "TE_00001200_LTR"; transcript_id "scaffold_126_599359-6e+05LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_157 EDTA exon 499658 5e+05 1271 + . gene_id "TE_00000781"; transcript_id "scaffold_157_499658-5e+05DNA/DTM"; family_id "DTM"; class_id "DNA";
scaffold_174 EDTA exon 599720 6e+05 762 - . gene_id "TE_00000934"; transcript_id "scaffold_174_599720-6e+05DNA/DTH"; family_id "DTH"; class_id "DNA";
scaffold_271 EDTA exon 499537 5e+05 9592 - . gene_id "TE_00001102_INT"; transcript_id "scaffold_271_499537-5e+05LTR/Gypsy"; family_id "Gypsy"; class_id "LTR";
scaffold_679 EDTA exon 2e+05 201482 6883 + . gene_id "TE_00000252"; transcript_id "scaffold_679_2e+05-201482LTR/unknown"; family_id "unknown"; class_id "LTR";
It appears that the start/end position was converted to scientific notation. This definitely causes issues with indexer. Once those lines were removed, the error is gone. Actually, it might be worth checking if the original GTF also had these lines, and whether removing them allows the GTF to work.
Thanks again for your patience, and all the best.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days
Hello!
Thank you for building this amazing tool. I am very excited to try it out. When I try to run TEcount I keep getting error saying
Error in building TE index
I think I formatted my TE gtf file correctly, this is the snippetIs my TE gtf file formatted wrong? Do you have any suggestions on how to move forward from here? I also have attached the original gff3 file I used to make this gtf file. Thank you in advance! ZradcleanRm1.gff3.zip