oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

EDTA crahed after no SINE found #427

Closed CongLiu37 closed 2 months ago

CongLiu37 commented 4 months ago

Hello,

I am using EDTA v2.2.0 to process my insect genomes. The commands looks like this: EDTA.pl --genome ${genome.fa} --species others --step all --overwrite 0 --sensitive 1 --anno 1 --threads 30 --cds ${rep.fna} The program crashed after failure of finding SINE:

Thu  1 Feb 23:29:39 JST 2024    EDTA_raw: Check dependencies, prepare working directories.

Thu  1 Feb 23:29:41 JST 2024    Start to find LTR candidates.

Thu  1 Feb 23:29:41 JST 2024    Identify LTR retrotransposon candidates from scratch.

Fri  2 Feb 00:09:12 JST 2024    Finish finding LTR candidates.

Fri  2 Feb 00:09:12 JST 2024    Start to find SINE candidates.

cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in genome.fa.mod.EDTA.raw/genome.fa.mod.SINE.raw.fa
    If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.

It might make some sense as RepeatModeler+RepeatMasker estimated low SINE load in my genomes (<5% for most cases, generally 1.5%-3%). So I am wondering if there is any way to finish EDTA pipeline even if no SINE is found in the genome?

Sincerely,

Cong

oushujun commented 4 months ago

That's abnormal. In 2.2.0, it's allowed to have 0 SINE or LINE found. Maybe you were using a slightly older version. Do you see anything in the raw/SINE folder?

Shujun

CongLiu37 commented 4 months ago

I am using EDTA v2.2.0 installed by mamba:

$ EDTA.pl

#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0  #####
##### Shujun Ou (shujun.ou.1@gmail.com)             #####
#########################################################

Parameters: 

At least 1 parameter is required:
1) Input fasta file: --genome

This is the Extensive de-novo TE Annotator that generates a high-quality
structure-based TE library. Usage:

There is basically nothing in raw/SINE:

$ ls genome.fa.mod.EDTA.raw/SINE/
genome.fa.mod

Sincerely,

Cong

oushujun commented 4 months ago

Please pull the GitHub version instead, thanks!

Shujun

colindaven commented 4 months ago

Will it work if you add --force 1 to add the rice (I think) repeats to your command ?

CongLiu37 commented 4 months ago

Hello,

I tried to pull the EDTA github while keep all dependencies in mamba, but still failed with the test:

$ EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10

#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0  #####
##### Shujun Ou (shujun.ou.1@gmail.com)             #####
#########################################################

Parameters: --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10

Fri 16 Feb 00:24:41 JST 2024    Dependency checking:
                All passed!

    A custom library ../database/rice7.0.0.liban is provided via --curatedlib. Please make sure this is a manually curated library but not machine generated.

    A CDS file genome.cds.fa is provided via --cds. Please make sure this is the DNA sequence of coding regions only.

    A BED file is provided via --exclude. Regions specified by this file will be excluded from TE annotation and masking.

Fri 16 Feb 00:24:42 JST 2024    Obtain raw TE libraries using various structure-based programs: 
Fri 16 Feb 00:24:42 JST 2024    EDTA_raw: Check dependencies, prepare working directories.

Fri 16 Feb 00:24:43 JST 2024    Start to find LTR candidates.

Fri 16 Feb 00:24:43 JST 2024    Identify LTR retrotransposon candidates from scratch.

Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
Fri 16 Feb 00:25:16 JST 2024    Finish finding LTR candidates.

Fri 16 Feb 00:25:16 JST 2024    Start to find SINE candidates.

cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in genome.fa.mod.EDTA.raw/genome.fa.mod.SINE.raw.fa
    If you believe the program is working properly, this may be caused by the lack of SINEs in your genome. 

I also tried --force 1. The test was finished with warning:

$ EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10 --force 1

#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0  #####
##### Shujun Ou (shujun.ou.1@gmail.com)             #####
#########################################################

Parameters: --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10 --force 1

Fri 16 Feb 00:29:29 JST 2024    Dependency checking:
                All passed!

    A custom library ../database/rice7.0.0.liban is provided via --curatedlib. Please make sure this is a manually curated library but not machine generated.

    A CDS file genome.cds.fa is provided via --cds. Please make sure this is the DNA sequence of coding regions only.

    A BED file is provided via --exclude. Regions specified by this file will be excluded from TE annotation and masking.

Fri 16 Feb 00:29:30 JST 2024    Obtain raw TE libraries using various structure-based programs: 
Fri 16 Feb 00:29:30 JST 2024    EDTA_raw: Check dependencies, prepare working directories.

Fri 16 Feb 00:29:31 JST 2024    Start to find LTR candidates.

Fri 16 Feb 00:29:31 JST 2024    Identify LTR retrotransposon candidates from scratch.

Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
Fri 16 Feb 00:30:04 JST 2024    Finish finding LTR candidates.

Fri 16 Feb 00:30:04 JST 2024    Start to find SINE candidates.

cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

cat: genome.fa.mod.TIR.intact.raw.bed: No such file or directory
cat: genome.fa.mod.Helitron.intact.raw.bed: No such file or directory
Fri 16 Feb 00:30:04 JST 2024    Obtain raw TE libraries finished.
                All intact TEs found by EDTA: 
                    genome.fa.mod.EDTA.intact.raw.fa 
                    genome.fa.mod.EDTA.intact.raw.gff3

Fri 16 Feb 00:30:04 JST 2024    Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library: 

Warning: No repetitive sequences were detected in genome.fa.mod.LTR.raw.fa

Warning: No repetitive sequences were detected in genome.fa.mod.SINE.raw.fa
Fri 16 Feb 00:35:07 JST 2024    EDTA advance filtering finished.

Fri 16 Feb 00:35:07 JST 2024    Perform EDTA final steps to generate a non-redundant comprehensive TE library.

cp: cannot stat '../genome.fa.mod.EDTA.raw/genome.fa.mod.RM2.fa': No such file or directory
                Skipping the RepeatModeler results (--sensitive 0).
                Run EDTA.pl --step final --sensitive 1 if you want to add RepeatModeler results.

Fri 16 Feb 00:35:08 JST 2024    Clean up TE-related sequences in the CDS file with TEsorter.

                Remove CDS-related sequences in the EDTA library.

                Remove CDS-related sequences in intact TEs.

Fri 16 Feb 00:39:23 JST 2024    Combine the high-quality TE library rice7.0.0.liban with the EDTA library:

Fri 16 Feb 00:41:42 JST 2024    EDTA final stage finished! You may check out:
                The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
                Family names of intact TEs have been updated by rice7.0.0.liban: genome.fa.mod.EDTA.intact.gff3
                Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa
                The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa

Fri 16 Feb 00:41:42 JST 2024    Perform post-EDTA analysis for whole-genome annotation:

Fri 16 Feb 00:41:42 JST 2024    Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.

Fri 16 Feb 00:42:04 JST 2024    TE annotation using the EDTA library has finished! Check out:
                Whole-genome TE annotation (total TE: 29.83%): genome.fa.mod.EDTA.TEanno.gff3
                Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum
                Low-threshold TE masking for MAKER gene annotation (masked: 15.63%): genome.fa.mod.MAKER.masked

Fri 16 Feb 00:42:04 JST 2024    Evaluate the level of inconsistency for whole-genome TE annotation:

Fri 16 Feb 00:42:18 JST 2024    Evaluation of TE annotation finished! Check out these files:

                Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
                Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
                Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum

                If you want to learn more about the formatting and information of these files, please visit:
                    https://github.com/oushujun/EDTA/wiki/Making-sense-of-EDTA-usage-and-outputs---Q&A

The results looks OK?

$ ls -l 
total 15238
-rw-r--r-- 1 c-liu bourguignonuni 1000014 Feb 15 18:29 Alyrata.test.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000009 Feb 15 18:29 Col.test.fa
-rw-r--r-- 1 c-liu bourguignonuni  199787 Feb 15 18:29 genome.cds.fa
-rw-r--r-- 1 c-liu bourguignonuni      38 Feb 15 18:29 genome.cds.list
-rw-r--r-- 1 c-liu bourguignonuni   61399 Feb 15 18:29 genome.exclude.bed
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 15 18:29 genome.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 16 00:29 genome.fa.mod
drwxr-sr-x 2 c-liu bourguignonuni    4096 Feb 16 00:42 genome.fa.mod.EDTA.anno
drwxr-sr-x 3 c-liu bourguignonuni  131072 Feb 16 00:35 genome.fa.mod.EDTA.combine
drwxr-sr-x 3 c-liu bourguignonuni    4096 Feb 16 00:41 genome.fa.mod.EDTA.final
-rw-r--r-- 1 c-liu bourguignonuni 2787953 Feb 16 00:41 genome.fa.mod.EDTA.intact.fa
-rw-r--r-- 1 c-liu bourguignonuni    5040 Feb 16 00:41 genome.fa.mod.EDTA.intact.gff3
drwxr-sr-x 7 c-liu bourguignonuni    4096 Feb 16 00:30 genome.fa.mod.EDTA.raw
-rw-r--r-- 1 c-liu bourguignonuni  109850 Feb 16 00:42 genome.fa.mod.EDTA.TEanno.gff3
-rw-r--r-- 1 c-liu bourguignonuni   18759 Feb 16 00:42 genome.fa.mod.EDTA.TEanno.sum
-rw-r--r-- 1 c-liu bourguignonuni 5306510 Feb 16 00:41 genome.fa.mod.EDTA.TElib.fa
-rw-r--r-- 1 c-liu bourguignonuni       0 Feb 16 00:40 genome.fa.mod.EDTA.TElib.novel.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 16 00:42 genome.fa.mod.MAKER.masked
-rw-r--r-- 1 c-liu bourguignonuni 1000010 Feb 15 18:29 Ler.test.fa
-rw-r--r-- 1 c-liu bourguignonuni     543 Feb 15 18:29 memo
-rw-r--r-- 1 c-liu bourguignonuni     996 Feb 15 18:29 README.txt
lrwxrwxrwx 1 c-liu bourguignonuni      73 Feb 16 00:12 rice7.0.0.liban -> /bucket/.mabuya/BourguignonU/Cong/Softwares/EDTA/database/rice7.0.0.liban

However, I do not understand how it will make sense to add rice TEs to distant genomes. In my case I am working with insects that do not have much ecological interactions with rice, and seems people with prokaryotes are also using --force 1 (say #405?). Could you please explain this option with a bit more details? @oushujun

Sincerely,

Cong

Sincerely,

Cong

WuSir312 commented 3 months ago

Hello, thanks for your nice EDTA. I am using EDTA v2.2.0 to analysis an insect's genome. However, there is no SINEs in some insect, which also found in this passage (https://doi.org/10.1186/s12915-021-01158-2). How can I finish the EDTA? should I rty --force 1? Sincerely,

ShuangXiong Wu

CongLiu37 commented 3 months ago

Hello @WuSir312

I am running EDTA with --force 1 and sensitive for my insect genomes. I manually checked the *.TEanno.sum for a few genomes in which EDTA already finished, and the results look normal: LINE/SINE are found, the total TE load looks acceptable, the proportion of LINE looks reasonable.

Sincerely,

Cong

oushujun commented 3 months ago

Hello Cong and Shuangxiong,

If you are pretty sure that your genome does not have SINE/LINE or any of the TE types EDTA recognizes, using --force 1 will make sense because EDTA will use rice TE libraries to skip the step and allow EDTA to finish. Using rice sequences likely won't impact your existing TEs because they are probably very dissimilar, which means the rice sequences will do nothing except help you finish the EDTA execution. But if you know that your species has the TE type but EDTA didn't have it annotated due to programmatic errors, using --force 1 will not make sense.

Thanks, Shujun

yyliang12 commented 2 months ago

Hi Dr. Shujun,

Thanks for developing such a great program!

Lately I've also encountered the SINE results not found! problem while annotating TE sequences within pineapple genome with either v2.2.0 or v2.2.1, and here's the errors said:

cp: cannot stat 'P1_hap1_FINAL.fasta.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in P1_hap1_FINAL.fasta.mod.EDTA.raw/P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.

But it is strange that I've succeed in annotating the same genome with only sequences of chromosome level right days before via EDTA.pl v2.2.0 pipeline installed by mamba.

I've also tried to annotate only SINE repeat with EDTA_raw.pl --type sine and this program surprisingly finished without errors. Here's the output:

.
├── EDTA_SINE.log
├── P1_hap1_FINAL.fasta -> /home/yanyang_liang/ProgramFiles/2024/03_Aco_Annotation/00_Data/01_Genome/P1_hap1_FINAL.fasta
├── P1_hap1_FINAL.fasta.mod
└── P1_hap1_FINAL.fasta.mod.EDTA.raw
    ├── Helitron
    ├── LINE
    ├── LTR
    ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    ├── SINE
    │   ├── HMM_out
    │   ├── P1_hap1_FINAL.fasta_bbb805cef30611ee9c7590e2ba919692-matches.fasta
    │   ├── P1_hap1_FINAL.fasta.mod -> ../../P1_hap1_FINAL.fasta.mod
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.cleanup
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.dirt.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.lib
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.pep
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.faa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.gff3
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.domtbl
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    │   ├── Seed_SINE.fa
    │   ├── Step1_extend_tsd_input_1.fa
    │   ├── Step1_extend_tsd_input_2.fa
    │   ├── Step1_extend_tsd_input.fa
    │   ├── Step2_extend_blast_input.fa
    │   ├── Step2_extend_blast_input_rename.fa
    │   ├── Step2_tsd_output.fa
    │   ├── Step2_tsd.txt
    │   ├── Step3_blast_output.out
    │   ├── Step3_blast_output.out.fa
    │   ├── Step3_blast_output.paf
    │   ├── Step3_blast_process_output.fa
    │   ├── Step4_rna_input.fasta
    │   ├── Step4_rna_output.fasta
    │   ├── Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
    │   ├── Step4_rna_output.out
    │   ├── Step5_trf_output.fasta
    │   ├── Step6_irf_input.fasta
    │   ├── Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
    │   ├── Step6_irf_output.fasta
    │   ├── Step7_cluster_output.fasta
    │   └── Step7_cluster_output.fasta.clstr
    └── TIR

7 directories, 41 files

I've checked that there are hundreds of sequence in file Seed_SINE.fa:

file          format  type  num_seqs  sum_len  min_len  avg_len  max_len
Seed_SINE.fa  FASTA   DNA        122   29,941       98    245.4      755

Any suggestions that I can take to solve this problem?

Best, Yanyang.

yyliang12 commented 2 months ago

Hi Dr. Shujun,

Thanks for developing such a great program!

Lately I've also encountered the SINE results not found! problem while annotating TE sequences within pineapple genome with either v2.2.0 or v2.2.1, and here's the errors said:

cp: cannot stat 'P1_hap1_FINAL.fasta.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in P1_hap1_FINAL.fasta.mod.EDTA.raw/P1_hap1_FINAL.fasta.mod.SINE.raw.fa
  If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.

But it is strange that I've succeed in annotating the same genome with only sequences of chromosome level right days before via EDTA.pl v2.2.0 pipeline installed by mamba.

I've also tried to annotate only SINE repeat with EDTA_raw.pl --type sine and this program surprisingly finished without errors. Here's the output:

.
├── EDTA_SINE.log
├── P1_hap1_FINAL.fasta -> /home/yanyang_liang/ProgramFiles/2024/03_Aco_Annotation/00_Data/01_Genome/P1_hap1_FINAL.fasta
├── P1_hap1_FINAL.fasta.mod
└── P1_hap1_FINAL.fasta.mod.EDTA.raw
    ├── Helitron
    ├── LINE
    ├── LTR
    ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    ├── SINE
    │   ├── HMM_out
    │   ├── P1_hap1_FINAL.fasta_bbb805cef30611ee9c7590e2ba919692-matches.fasta
    │   ├── P1_hap1_FINAL.fasta.mod -> ../../P1_hap1_FINAL.fasta.mod
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.cleanup
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.dirt.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.lib
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.pep
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.faa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.gff3
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.domtbl
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    │   ├── Seed_SINE.fa
    │   ├── Step1_extend_tsd_input_1.fa
    │   ├── Step1_extend_tsd_input_2.fa
    │   ├── Step1_extend_tsd_input.fa
    │   ├── Step2_extend_blast_input.fa
    │   ├── Step2_extend_blast_input_rename.fa
    │   ├── Step2_tsd_output.fa
    │   ├── Step2_tsd.txt
    │   ├── Step3_blast_output.out
    │   ├── Step3_blast_output.out.fa
    │   ├── Step3_blast_output.paf
    │   ├── Step3_blast_process_output.fa
    │   ├── Step4_rna_input.fasta
    │   ├── Step4_rna_output.fasta
    │   ├── Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
    │   ├── Step4_rna_output.out
    │   ├── Step5_trf_output.fasta
    │   ├── Step6_irf_input.fasta
    │   ├── Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
    │   ├── Step6_irf_output.fasta
    │   ├── Step7_cluster_output.fasta
    │   └── Step7_cluster_output.fasta.clstr
    └── TIR

7 directories, 41 files

I've checked that there are hundreds of sequence in file Seed_SINE.fa:

file          format  type  num_seqs  sum_len  min_len  avg_len  max_len
Seed_SINE.fa  FASTA   DNA        122   29,941       98    245.4      755

Any suggestions that I can take to solve this problem?

Best, Yanyang.

Hi Dr.Shuju,

I think I might find the answer to this problem. After I added export PATH="$~/miniconda3/envs/EDTA/bin:$PATH" to my script, EDTA ran through SINE annotation properly. So it is possibly that some environment variable affected the EDTA pipeline.

Thanks, Yanyang.

oushujun commented 2 months ago

Hi Yangyang,

I am glad you found the solution.

The line of code report the error is

die "Error: SINE results not found!\n\n" unless -e "$genome.EDTA.raw/$genome.SINE.raw.fa";

It should work even if your genome file contains a path because this code block handles the path:

my $genome_file = basename($genome); ln -s $genome $genome_file unless -e $genome_file; $genome = $genome_file;

Thanks, Shujun