mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
228 stars 30 forks source link

TE GTF format error #205

Open Kkk-2001 opened 1 month ago

Kkk-2001 commented 1 month ago

when i tried to build a TE index,there always show me that

INFO @ Thu, 24 Oct 2024 11:14:53: Building TE index .......

chrA01 EDTA DNA/Helitron 1241 1375 . - . gene_id "nbis-gene-1"; transcript_id "TE_00008421"; ID "TE_00008421"; Method "homology"; Parent "nbis-gene-1"; TE GTF format error! There is no annotation at line 1. Error in building TE index fig and here is my part of TE.gtf TE_all2 - part1.gtf.txt

hoping for your answer thanks

Kkk-2001 commented 1 month ago

i have solved this problem by using the makeTEgtf.pl however there is a new problem below when finished buliding index,it shown this message: Reading sample files ...

[E::idx_find_and_load] Could not retrieve index file for '/home///RNA-seq-data/bam-maping/6h1/6h1.bam' **NOT COMPLETE*** If the BAM file is sorted by coordinates, please specify --sortByPos and re-run! Error: 0 [Exception type: SystemExit, raised in TEtranscripts:320] 1

hoping for your answer

olivertam commented 1 month ago

Hi,

Thank you for your interest in the software. What aligner are you using? If you're using HiSat2, you might need to specify --no-mixed and --no-discordant to remove discordant alignments (see here). If this does not resolve your issue, please provide your TEtranscripts command line, the full header from your BAM file (samtools view -H) and the first 10 lines of your gene and TE GTF files.

Thanks.

Kkk-2001 commented 4 weeks ago

i uesd bowtie2 to align at first time when i turned to use STAR to align with default parameter,it also shown that [E::idx_find_and_load] Could not retrieve index file for '/home/wjb-z2/K/RNA-seq-data/bam-maping/6h1/6h1Aligned.out.bam' but it still keep running and solve the problem itself,and i finished the following steps

however a error appear in the following procedure 1 i'm wondering whether this problem is caused by the server,so i decided to run it again the problem below didn't solved: LTR/Copiastency! 3000769..3005760#LTR/Copia:chrA01:3000769..3005760#LTR/Copia:LTR/Copia Error: 1 [Exception type: SystemExit, raised in TEindex.py:380]

olivertam commented 4 weeks ago

Hi,

I suspect that there are issues with the TE GTF, where the name has both exclamation marks and #, which are typically not well liked by algorithms. If these could be removed, then hopefully that issue will be resolved.

Thanks.

Kkk-2001 commented 4 weeks ago

hi, i have checked my TE gtf file with head - 2 i think there might be something wrong with it, but this file is present by the script makeTEgtf.pl i'm wondering whether to input the TE gtf file like genome GTF file shown below 3 hoping for your answer thank you

olivertam commented 4 weeks ago

Hi,

You're right that the TE GTF is weirdly formatted, with a weird line break after the family_id value. I suspect that that might be in the input file (perhaps a non-Unix line break). Could you provide the first 10 lines of the input that you used for the makeTEgtf.pl script?

Thanks

Kkk-2001 commented 3 weeks ago

hi, I changed the in put file,and it works; however it shown me that TE inconsistency how can i solve this problem watching for your answer,thanks TE inconsistency! 3000769..3005760#LTR/Copia:chrA01:3000769..3005760#LTR/Copia:LTR/Copia:LTR/Copia Error: 1 [Exception type: SystemExit, raised in TEindex.py:380] 1

olivertam commented 3 weeks ago

Hi,

Could you provide either the input file that you used, or the GTF output so we can troubleshoot?

Thanks.

Kkk-2001 commented 3 weeks ago

TE0.zip here is my gtf file,thanks for your patient and hoping for your answer

olivertam commented 3 weeks ago

Hi,

There are many entries in your GTF file that has a very long (and unusual) name:

chrA01  2       exon    3000770 3005760 .       -       .       gene_id "chrA01:3000769..3005760#LTR/Copia"; transcript_id "chrA01:3000769..3005760#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:3000769..3005760#LTR/Copia:TE";
chrA01  2       exon    3888961 3893525 .       +       .       gene_id "chrA01:3888960..3893525#LTR/Copia"; transcript_id "chrA01:3888960..3893525#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:3888960..3893525#LTR/Copia:TE";
chrA01  2       exon    4578354 4583306 .       +       .       gene_id "chrA01:4578353..4583306#LTR/Copia"; transcript_id "chrA01:4578353..4583306#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:4578353..4583306#LTR/Copia:TE";
chrA01  2       exon    6028762 6033776 .       +       .       gene_id "chrA01:6028761..6033776#LTR/Copia"; transcript_id "chrA01:6028761..6033776#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:6028761..6033776#LTR/Copia:TE";

In comparison, most of the other lines are like this:

chrA01  2       exon    47714   47826   .       -       .       gene_id "TE_00005857_LTR"; transcript_id "TE_00005857_LTR"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00005857_LTR:TE";
chrA01  2       exon    52520   53156   .       -       .       gene_id "TE_00007443_INT"; transcript_id "TE_00007443_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00007443_INT:TE";
chrA01  2       exon    120436  120703  .       +       .       gene_id "TE_00003717_INT"; transcript_id "TE_00003717_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00003717_INT:TE";
chrA01  2       exon    155115  155431  .       +       .       gene_id "TE_00009250_INT"; transcript_id "TE_00009250_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00009250_INT:TE";
chrA01  2       exon    162906  163063  .       -       .       gene_id "TE_00005013_INT"; transcript_id "TE_00005013_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00005013_INT:TE";
chrA01  2       exon    198903  199001  .       +       .       gene_id "TE_00006238_INT"; transcript_id "TE_00006238_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00006238_INT:TE";
chrA01  2       exon    344116  345913  .       -       .       gene_id "TE_00003018_INT"; transcript_id "TE_00003018_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00003018_INT:TE";

It appears that the original input might have these lines formatted differently from the others (were they merged from another prediction program?), and thus cause issues with the TE GTF when being processed.

Looking at the first example, it appears that the chrA01:3000769..3005760#LTR/Copia is also represented by three other overlapping entries:

chrA01  2       exon    3000770 3000938 .       -       .       gene_id "TE_00008904_LTR"; transcript_id "TE_00008904_LTR"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00008904_LTR:TE";
chrA01  2       exon    3000770 3005760 .       -       .       gene_id "chrA01:3000769..3005760#LTR/Copia"; transcript_id "chrA01:3000769..3005760#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:3000769..3005760#LTR/Copia:TE";
chrA01  2       exon    3000940 3005590 .       -       .       gene_id "TE_00008840_INT"; transcript_id "TE_00008840_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00008840_INT:TE";
chrA01  2       exon    3005592 3005760 .       -       .       gene_id "TE_00008904_LTR"; transcript_id "TE_00008904_LTR_dup1"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00008904_LTR:TE";

You could either remove the entry that is throwing the error, if you prefer to keep the "intact" entry, then I would recommend renaming the gene_id and transcript_id to remove symbols such as #, : and ...

Alternatively, if you want to provide the original input, we can quickly check to see if they should be processed as two separate inputs before merging into a single TE GTF.

Thanks.

Kkk-2001 commented 3 weeks ago

Thank you very much for your suggestion, the problem was resolved after removing the duplicates But the final result file is only treatment_vs_control.cntTable, treatment_vs_control_DESeq.R, treatment_vs_control_gene_TE_analysis.txt treatment_vs_control_sigdiff_geneTE.txt is missing. The error information is as follows ![新建会话 - wjb-z2@node2~ - Xshell 7 (Free for Home_School) 2024_10_29 14_07_37](https://github.com/user-attachments/assets/9d9fd415-b367-4a2d-97cd-7c8f8dc9d508) hoping for your answer ps:the result files are below result.zip

olivertam commented 3 weeks ago

Hi,

If you scroll down to the TE annotations, you will find that there is a "hidden" line-break in the output. This is typically introduced by Mac or Windows, as they use a different line break character than Unix. A quick fix for your cntTable file is:

$ sed -i 's/\r//g' 6_vs_0.cntTable

You will probably need to do the same to your GTF file, since that's where the error was propagated from.

Thanks.

Kkk-2001 commented 3 weeks ago

hello thanks for your help and patient,finally i completed the analysis i am wondering that whether the number in the result flie is the FPKM of TE or other value(like TPM) thanks for your help again and there is another question that why are there 900,000 TE in the gff and gtf file and only 10,000 TE in the results gff.zip gtf.zip result.zip is this caused by [E::idx_find_and_load] Could not retrieve index file for '/home/wjb-z2/Kongfanzhe/RNA-seq-data/bam-maping/0h2/0h2Aligned.out.bam'? the bam file is produced by STAR

olivertam commented 3 weeks ago

Hi,

TEtranscripts aggregates counts from different copies of a TE subfamily to enable better statistical analysis. This is done by combining counts based on the gene_id. Thus you would not get a count for each line of the GTF. The cntTable file contains the raw count. The gene_TE_analysis.txt contains the output from DESeq2, which is normalized mean (done by DESeq2) across all samples (baseMean), log 2 fold change (log2FoldChange), the standard error on fold change (lfcse), the raw p value (pval) and BH-corrected p value (padj). The sigdiff_gene_TE.txt contains the hits that were significant based on the significance threshold used (default is 0.05).

Thanks

Kkk-2001 commented 3 weeks ago

hi! Sorry to bother you again, i'm wondering that if I only need TE information but not genes, can I output only information about TE? after checking the output file, i found that the results were concentrated on some chromosomes and not at all on others I wonder if it's caused by the quantity of genes and TEs is over the limit and make the result incomplete

olivertam commented 3 weeks ago

Hi,

If you just want the TE output, you can just grep for lines with a :. I'm not sure what you mean by the results being concentrated on some chromosomes. Please be aware that the TE results are aggregated by gene_id, which typically contain copies from multiple chromosomes, so it's not clear how you would have determine which chromosomes were "concentrated".

Thanks.