oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
349 stars 73 forks source link

CrossmatchSearchEngine::parseOutput issue #479

Open cai1991 opened 4 months ago

cai1991 commented 4 months ago

Hi Shujun,

Thanks a lot for developing this great tool. I installed EDTA using: mamba env create -f EDTA_2.2.x.yml

EDTA works well on test data. But for my genome (plants, genome size ~600 Mb), I encountered wanings/errors, such as "CrossmatchSearchEngine::parseOutput: Unable to open results file: " and "SINE/NA not found in the TE_SO database". Please see below the detailed information. I obtained all the output files. Did these warnings influence the results and could you please help me to figure it out? Thanks a lot in advance.

Best regards, Chengcheng

my command:

#!/bin/bash

genome=bro.LA105.7gaps.chr.newID.fa
cds=T24.chr.cds.fasta
threads=48

/data3/caicc/Softwares/50/miniconda3/envs/EDTA2/bin/perl /data3/caicc/Softwares/50/EDTA/EDTA-master/EDTA.pl --genome $genome --cds $cds --anno 1 --threads $threads --overwrite 1 --sensitive 1

The log file:


#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.1  #####
##### Shujun Ou (shujun.ou.1@gmail.com)             #####
#########################################################

Parameters: --genome bro.LA105.7gaps.chr.newID.fa --cds T24.chr.cds.fasta --anno 1 --threads 48 --overwrite 1 --sensitive 1 --debug 1

Tue Jun 25 21:49:51 CST 2024    Dependency checking:
                All passed!

    A CDS file T24.chr.cds.fasta is provided via --cds. Please make sure this is the DNA sequence of coding regions only.

Tue Jun 25 21:49:59 CST 2024    Obtain raw TE libraries using various structure-based programs: 
Tue Jun 25 21:49:59 CST 2024    EDTA_raw: Check dependencies, prepare working directories.

Tue Jun 25 21:50:02 CST 2024    Start to find LTR candidates.

Tue Jun 25 21:50:02 CST 2024    Identify LTR retrotransposon candidates from scratch.

Tue Jun 25 23:10:51 CST 2024    Finish finding LTR candidates.

Tue Jun 25 23:10:51 CST 2024    Start to find SINE candidates.

Wed Jun 26 00:53:30 CST 2024    Finish finding SINE candidates.

Wed Jun 26 00:53:30 CST 2024    Start to find LINE candidates.

Wed Jun 26 00:53:30 CST 2024    Identify LINE retrotransposon candidates from scratch.

Wed Jun 26 22:22:12 CST 2024    Finish finding LINE candidates.

Wed Jun 26 22:22:12 CST 2024    Start to find TIR candidates.

Wed Jun 26 22:22:12 CST 2024    Identify TIR candidates from scratch.

Species: others
Thu Jun 27 00:52:18 CST 2024    Finish finding TIR candidates.

Thu Jun 27 00:52:18 CST 2024    Start to find Helitron candidates.

Thu Jun 27 00:52:18 CST 2024    Identify Helitron candidates from scratch.

Thu Jun 27 04:49:40 CST 2024    Finish finding Helitron candidates.

Thu Jun 27 04:49:40 CST 2024    Execution of EDTA_raw.pl is finished!

Thu Jun 27 04:49:40 CST 2024    Obtain raw TE libraries finished.
                All intact TEs found by EDTA: 
                    bro.LA105.7gaps.chr.newID.fa.mod.EDTA.intact.raw.fa 
                    bro.LA105.7gaps.chr.newID.fa.mod.EDTA.intact.raw.gff3

Thu Jun 27 04:49:40 CST 2024    Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library: 

CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_3972217.ThuJun270451122024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.raw.fa.HQ_batch-131.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4026357.ThuJun270453562024/bro.LA105.7gaps.chr.newID.fa.mod.TIR.intact.raw.fa_batch-13.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-2.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-8.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-11.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-39.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-93.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-114.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-182.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-223.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-239.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
CrossmatchSearchEngine::parseOutput: Unable to open results file: /data3/caicc/BR/03BolGapFree/01hifiasm/LA105Cell1/05repeats/02EDTA/bro.LA105.7gaps.chr.newID.fa.mod.EDTA.combine/RM_4156448.ThuJun270502322024/bro.LA105.7gaps.chr.newID.fa.mod.LTR.intact.raw.fa_batch-334.cat : No such file or directory at /data3/caicc/Softwares/50/miniconda3/envs/EDTA2/share/RepeatMasker/CrossmatchSearchEngine.pm line 552.
Thu Jun 27 05:12:08 CST 2024    EDTA advance filtering finished.

Thu Jun 27 05:12:08 CST 2024    Perform EDTA final steps to generate a non-redundant comprehensive TE library.

                Filter RepeatModeler results that are ignored in the raw step.

Thu Jun 27 05:12:48 CST 2024    Clean up TE-related sequences in the CDS file with TEsorter.

                Remove CDS-related sequences in the EDTA library.

                Remove CDS-related sequences in intact TEs.

SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
tRNA/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
SINE/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
Thu Jun 27 05:31:35 CST 2024    EDTA final stage finished! You may check out:
                The final EDTA TE library: bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TElib.fa
Thu Jun 27 05:31:35 CST 2024    Perform post-EDTA analysis for whole-genome annotation:

Thu Jun 27 05:31:35 CST 2024    Homology-based annotation of TEs using bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TElib.fa from scratch.

Thu Jun 27 06:29:33 CST 2024    TE annotation using the EDTA library has finished! Check out:
                Whole-genome TE annotation (total TE: 57.21%): bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TEanno.gff3
                Whole-genome TE annotation summary: bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TEanno.sum
                Whole-genome TE divergence plot: bro.LA105.7gaps.chr.newID.fa.mod_divergence_plot.pdf
                Whole-genome TE density plot: bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TEanno.density_plots.pdf
                Low-threshold TE masking for MAKER gene annotation (masked: 27.87%): bro.LA105.7gaps.chr.newID.fa.mod.MAKER.masked

Thu Jun 27 06:29:34 CST 2024    Evaluate the level of inconsistency for whole-genome TE annotation:

Thu Jun 27 06:34:02 CST 2024    Evaluation of TE annotation finished! Check out these files:

                Overall: bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TE.fa.stat.all.sum
                Nested: bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TE.fa.stat.nested.sum
                Non-nested: bro.LA105.7gaps.chr.newID.fa.mod.EDTA.TE.fa.stat.redun.sum

                If you want to learn more about the formatting and information of these files, please visit:
                    https://github.com/oushujun/EDTA/wiki/Making-sense-of-EDTA-usage-and-outputs---Q&A
oushujun commented 1 month ago

Dear Chengcheng,

Sorry for the long delay. EDTA configured RepeatMasker to use the rmblast engine. I haven't use the CrossmatchSearchEngine before. Are you aware of any special configurations?

Thanks! Shujun

cai1991 commented 1 month ago

Dear Shujun,

Sorry for the late response. I did not yet figure it out and was too occupied by other stuff.

Another thing I would like to mention is that different runs of the same genome with the same parameters seem to result in very different outputs, especielly for the Copia and Gypsy LTRs. Please see it in the attached figure. I run on my genome for five times and each time I obtained different results. The Copia and Gypsy ratio seem to vary a lot between some runs. I don't know whether this is caused by the above issues. My EDTA version is v2.2.1.

Best regards, Chengcheng

inconsistent results for different runs

oushujun commented 1 month ago

Do you have the same issue when running these five times? I also noticed the LTR performance is inferior to the previous versions in maize but unsure how prevalent this is.

Shujun

cai1991 commented 1 month ago

Yes, each time the same issue happens.

Best, Chengcheng

oushujun commented 1 month ago

Please check with your default $ENV, make sure there's no other version of Repeatmasker masking the conda version. The conda version should use rmblastn as the search engine.

Shujun