Open Kkk-2001 opened 1 month ago
i have solved this problem by using the makeTEgtf.pl however there is a new problem below when finished buliding index,it shown this message: Reading sample files ...
[E::idx_find_and_load] Could not retrieve index file for '/home///RNA-seq-data/bam-maping/6h1/6h1.bam' **NOT COMPLETE*** If the BAM file is sorted by coordinates, please specify --sortByPos and re-run! Error: 0 [Exception type: SystemExit, raised in TEtranscripts:320]
hoping for your answer
Hi,
Thank you for your interest in the software.
What aligner are you using? If you're using HiSat2, you might need to specify --no-mixed
and --no-discordant
to remove discordant alignments (see here).
If this does not resolve your issue, please provide your TEtranscripts
command line, the full header from your BAM file (samtools view -H
) and the first 10 lines of your gene and TE GTF files.
Thanks.
i uesd bowtie2 to align at first time when i turned to use STAR to align with default parameter,it also shown that [E::idx_find_and_load] Could not retrieve index file for '/home/wjb-z2/K/RNA-seq-data/bam-maping/6h1/6h1Aligned.out.bam' but it still keep running and solve the problem itself,and i finished the following steps
however a error appear in the following procedure i'm wondering whether this problem is caused by the server,so i decided to run it again the problem below didn't solved: LTR/Copiastency! 3000769..3005760#LTR/Copia:chrA01:3000769..3005760#LTR/Copia:LTR/Copia Error: 1 [Exception type: SystemExit, raised in TEindex.py:380]
Hi,
I suspect that there are issues with the TE GTF, where the name has both exclamation marks and #
, which are typically not well liked by algorithms. If these could be removed, then hopefully that issue will be resolved.
Thanks.
hi, i have checked my TE gtf file with head - i think there might be something wrong with it, but this file is present by the script makeTEgtf.pl i'm wondering whether to input the TE gtf file like genome GTF file shown below hoping for your answer thank you
Hi,
You're right that the TE GTF is weirdly formatted, with a weird line break after the family_id
value. I suspect that that might be in the input file (perhaps a non-Unix line break).
Could you provide the first 10 lines of the input that you used for the makeTEgtf.pl
script?
Thanks
hi, I changed the in put file,and it works; however it shown me that TE inconsistency how can i solve this problem watching for your answer,thanks TE inconsistency! 3000769..3005760#LTR/Copia:chrA01:3000769..3005760#LTR/Copia:LTR/Copia:LTR/Copia Error: 1 [Exception type: SystemExit, raised in TEindex.py:380]
Hi,
Could you provide either the input file that you used, or the GTF output so we can troubleshoot?
Thanks.
TE0.zip here is my gtf file,thanks for your patient and hoping for your answer
Hi,
There are many entries in your GTF file that has a very long (and unusual) name:
chrA01 2 exon 3000770 3005760 . - . gene_id "chrA01:3000769..3005760#LTR/Copia"; transcript_id "chrA01:3000769..3005760#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:3000769..3005760#LTR/Copia:TE";
chrA01 2 exon 3888961 3893525 . + . gene_id "chrA01:3888960..3893525#LTR/Copia"; transcript_id "chrA01:3888960..3893525#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:3888960..3893525#LTR/Copia:TE";
chrA01 2 exon 4578354 4583306 . + . gene_id "chrA01:4578353..4583306#LTR/Copia"; transcript_id "chrA01:4578353..4583306#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:4578353..4583306#LTR/Copia:TE";
chrA01 2 exon 6028762 6033776 . + . gene_id "chrA01:6028761..6033776#LTR/Copia"; transcript_id "chrA01:6028761..6033776#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:6028761..6033776#LTR/Copia:TE";
In comparison, most of the other lines are like this:
chrA01 2 exon 47714 47826 . - . gene_id "TE_00005857_LTR"; transcript_id "TE_00005857_LTR"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00005857_LTR:TE";
chrA01 2 exon 52520 53156 . - . gene_id "TE_00007443_INT"; transcript_id "TE_00007443_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00007443_INT:TE";
chrA01 2 exon 120436 120703 . + . gene_id "TE_00003717_INT"; transcript_id "TE_00003717_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00003717_INT:TE";
chrA01 2 exon 155115 155431 . + . gene_id "TE_00009250_INT"; transcript_id "TE_00009250_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00009250_INT:TE";
chrA01 2 exon 162906 163063 . - . gene_id "TE_00005013_INT"; transcript_id "TE_00005013_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00005013_INT:TE";
chrA01 2 exon 198903 199001 . + . gene_id "TE_00006238_INT"; transcript_id "TE_00006238_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00006238_INT:TE";
chrA01 2 exon 344116 345913 . - . gene_id "TE_00003018_INT"; transcript_id "TE_00003018_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00003018_INT:TE";
It appears that the original input might have these lines formatted differently from the others (were they merged from another prediction program?), and thus cause issues with the TE GTF when being processed.
Looking at the first example, it appears that the chrA01:3000769..3005760#LTR/Copia
is also represented by three other overlapping entries:
chrA01 2 exon 3000770 3000938 . - . gene_id "TE_00008904_LTR"; transcript_id "TE_00008904_LTR"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00008904_LTR:TE";
chrA01 2 exon 3000770 3005760 . - . gene_id "chrA01:3000769..3005760#LTR/Copia"; transcript_id "chrA01:3000769..3005760#LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "chrA01:3000769..3005760#LTR/Copia:TE";
chrA01 2 exon 3000940 3005590 . - . gene_id "TE_00008840_INT"; transcript_id "TE_00008840_INT"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00008840_INT:TE";
chrA01 2 exon 3005592 3005760 . - . gene_id "TE_00008904_LTR"; transcript_id "TE_00008904_LTR_dup1"; family_id "LTR/Copia"; class_id "LTR/Copia"; gene_name "TE_00008904_LTR:TE";
You could either remove the entry that is throwing the error, if you prefer to keep the "intact" entry, then I would recommend renaming the gene_id
and transcript_id
to remove symbols such as #
, :
and ..
.
Alternatively, if you want to provide the original input, we can quickly check to see if they should be processed as two separate inputs before merging into a single TE GTF.
Thanks.
Thank you very much for your suggestion, the problem was resolved after removing the duplicates But the final result file is only treatment_vs_control.cntTable, treatment_vs_control_DESeq.R, treatment_vs_control_gene_TE_analysis.txt treatment_vs_control_sigdiff_geneTE.txt is missing. The error information is as follows ![新建会话 - wjb-z2@node2~ - Xshell 7 (Free for Home_School) 2024_10_29 14_07_37](https://github.com/user-attachments/assets/9d9fd415-b367-4a2d-97cd-7c8f8dc9d508) hoping for your answer ps:the result files are below result.zip
Hi,
If you scroll down to the TE annotations, you will find that there is a "hidden" line-break in the output. This is typically introduced by Mac or Windows, as they use a different line break character than Unix.
A quick fix for your cntTable
file is:
$ sed -i 's/\r//g' 6_vs_0.cntTable
You will probably need to do the same to your GTF file, since that's where the error was propagated from.
Thanks.
hello thanks for your help and patient,finally i completed the analysis i am wondering that whether the number in the result flie is the FPKM of TE or other value(like TPM) thanks for your help again and there is another question that why are there 900,000 TE in the gff and gtf file and only 10,000 TE in the results gff.zip gtf.zip result.zip is this caused by [E::idx_find_and_load] Could not retrieve index file for '/home/wjb-z2/Kongfanzhe/RNA-seq-data/bam-maping/0h2/0h2Aligned.out.bam'? the bam file is produced by STAR
Hi,
TEtranscripts
aggregates counts from different copies of a TE subfamily to enable better statistical analysis. This is done by combining counts based on the gene_id
. Thus you would not get a count for each line of the GTF.
The cntTable
file contains the raw count. The gene_TE_analysis.txt
contains the output from DESeq2
, which is normalized mean (done by DESeq2
) across all samples (baseMean
), log 2 fold change (log2FoldChange
), the standard error on fold change (lfcse
), the raw p value (pval
) and BH-corrected p value (padj
). The sigdiff_gene_TE.txt
contains the hits that were significant based on the significance threshold used (default is 0.05).
Thanks
hi! Sorry to bother you again, i'm wondering that if I only need TE information but not genes, can I output only information about TE? after checking the output file, i found that the results were concentrated on some chromosomes and not at all on others I wonder if it's caused by the quantity of genes and TEs is over the limit and make the result incomplete
Hi,
If you just want the TE output, you can just grep
for lines with a :
.
I'm not sure what you mean by the results being concentrated on some chromosomes. Please be aware that the TE results are aggregated by gene_id
, which typically contain copies from multiple chromosomes, so it's not clear how you would have determine which chromosomes were "concentrated".
Thanks.
when i tried to build a TE index,there always show me that
INFO @ Thu, 24 Oct 2024 11:14:53: Building TE index .......
chrA01 EDTA DNA/Helitron 1241 1375 . - . gene_id "nbis-gene-1"; transcript_id "TE_00008421"; ID "TE_00008421"; Method "homology"; Parent "nbis-gene-1"; TE GTF format error! There is no annotation at line 1. Error in building TE index and here is my part of TE.gtf TE_all2 - part1.gtf.txt
hoping for your answer thanks