oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
335 stars 73 forks source link

${Species_name}.fa.mod.LTR.raw.fa-${Species_name}.fa.mod.TIR.raw.fa.fa" not found! #239

Closed yywyaoyaowu closed 2 years ago

yywyaoyaowu commented 2 years ago

Hi Shujun, When we run EDTA,we met an error, "No such file or directory at /public1/home/sc61338/01_software/anaconda3/envs/EDTA/share/EDTA/util/TE_purifier.pl line 103. Input file "Solanum_macrocarpon.fa.mod.LTR.raw.fa-Solanum_macrocarpon.fa.mod.TIR.raw.fa.fa" not found!"

Do you have any suggestions? Thanks very much!

Yaoyao

baozg commented 2 years ago

Hi, @oushujun

The true error is the as follow:

RepeatMasker version 4.1.0
Search Engine: NCBI/RMBLAST [ 2.10.0+ ]
Master RepeatMasker Database: 
/software/RepeatMasker/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: CONS-Dfam_3.1 )
Custom Repeat Library: test.fa.mod.TIR.raw.fa
analyzing file test.fa.mod.LTR.raw.fa
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /software/RepeatMasker/RepeatMasker/RepeatMasker line 792.

### LTR.raw.fa
>HiC_scaffold_1:29510396..29512839_LTR#LTR/unknown
TGTTG

### TIR.raw.fa
>HiC_scaffold_1:520069..520309#MITE/DTT TSD:TA_TA
CTCCCTCCG

Although the fasta name is less than 15 character, but it fail in the RepeatMask step. Maybe error log should print in a specific log file for user to debug ? There are many 2>/dev/null in the all EDTA pipeline, it will erase all error message from the other software.

oushujun commented 2 years ago

It seems like nowadays TE coordinates are too long to fit the RepeatMasker requirement because of longer sequences. In this case, there should be sequence IDs like HiC_scaffold_10:295103966..295128396_LTR#LTR/unknown that exceeds 50 characters required by RepeatMasker. To avoid cases like this, I will change the sequence ID length to 13 characters max, so that TE identified from sequences with up to 999.999999 Mb can be fit in the RepeatMasker naming requirement. A simple fix for your case is to replace strings of HiC_scaffold to, e.g., HiCScf for your genome assembly and rerun EDTA from scratch.

Best, Shujun