oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
335 stars 73 forks source link

Provide --curatedlib and encounter error: RepeatMasker results not found #220

Closed zmz1988 closed 2 years ago

zmz1988 commented 3 years ago

Dear developers,

Thanks for developing this awesome tool! I have tried to run it recently and encountered an error message: ERROR: RepeatMasker results not found in necat_ragtag.fasta.mod.out!'

My code is: 'perl /home/EDTA/EDTA.pl --genome necat_ragtag.fasta --cds Araport11_genes.202106.cds.fasta --curatedlib TAIR10_TE.fasta --overwrite 1 --sensitive 1 --evaluate 1 --threads 10 --anno 1'

Please see below in part of the log file:

'2021-09-10 04:21:53,060 -INFO- generating gene anntations 2021-09-10 04:21:56,808 -INFO- 648 sequences classified by HMM 2021-09-10 04:21:56,809 -INFO- see protein domain sequences in Araport11_genes.202106.cds.fasta.code.rexdb.dom.faa and annotation gff3 file in Araport11_genes.202106.cds.fasta.code.rexdb.dom.gff3 2021-09-10 04:21:56,809 -INFO- classifying the unclassified sequences by searching against the classified ones 2021-09-10 04:21:58,380 -INFO- using the 80-80-80 rule 2021-09-10 04:21:58,380 -INFO- run CMD: makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl 2021-09-10 04:21:58,529 -INFO- run CMD: blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10 2021-09-10 04:22:09,237 -INFO- 111 sequences classified in pass 2 2021-09-10 04:22:09,237 -INFO- total 759 sequences classified. 2021-09-10 04:22:09,237 -INFO- see classified sequences in Araport11_genes.202106.cds.fasta.code.rexdb.cls.tsv 2021-09-10 04:22:09,237 -INFO- writing library for RepeatMasker in Araport11_genes.202106.cds.fasta.code.rexdb.cls.lib 2021-09-10 04:22:10,929 -INFO- writing classified protein domains in Araport11_genes.202106.cds.fasta.code.rexdb.cls.pep 2021-09-10 04:22:11,425 -INFO- Summary of classifications: Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Bel-Pao 7 0 0 0 LTR Copia 97 75 13 0 LTR Gypsy 167 125 16 0 LTR Retrovirus 1 0 0 0 LTR mixture 6 0 0 0 Penelope unknown 4 0 0 0 LINE unknown 119 0 0 0 TIR EnSpm_CACTA 2 0 0 0 TIR MuDR_Mutator 64 0 0 0 TIR PIF_Harbinger 17 0 0 0 TIR hAT 43 0 0 0 Helitron unknown 3 0 0 0 Maverick unknown 224 0 0 0 mixture mixture 5 0 0 0 2021-09-10 04:22:11,427 -INFO- Pipeline done. 2021-09-10 04:22:11,427 -INFO- cleaning the temporary directory ./tmp

    Reclassify sequence based on RepeatMasker .out file
    The RM.out file is generated using a library with family classification to mask the seq.fa file.
            perl classify_by_lib_RM.pl -seq seq.fa -RM seq.fa.out -cov 80 -len 80 -iden 80

    Replace strings in the target with a conversion list (old, new)
            perl rename_by_list.pl target list mode[0|1]
                    mode = 0, generic replace (slow)
                    mode = 1, specific for gff3 files

mv: cannot stat 'necat_ragtag.fasta.mod.EDTA.intact.fa.rename': No such file or directory Fri 10 Sep 04:31:23 BST 2021 Homology-based annotation of TEs using necat_ragtag.fasta.mod.EDTA.TElib.fa from scratch.

ERROR: RepeatMasker results not found in necat_ragtag.fasta.mod.out!'

I had ran the test file in the EDTA folder, and everything went well with all the annotation outputs. But the run with my own data seems stuck at the annotation step, though I have the RepeatMasker and RepeatModeler installed. I'm not sure where got wrong. Could you please help?

Thanks!

oushujun commented 3 years ago

Try to rename your sequences. Or reduce parameters one by one until you find which parameter is giving you issues.

zmz1988 commented 3 years ago

Thanks! I tried using a shorter file name, but the error in annotation step still occurred, though it showed me that the EDTA final step is finished.

So I tried to just run annotation step again, as I already got the final TE library. In this case, if I use perl /home/EDTA/EDTA.pl --genome necat_ragtag.fasta --cds Araport11_genes.202106.cds.fasta --curatedlib TAIR10_TE.fasta --threads 10 --anno 1 --step anno, will EDTA automatically picks up the .mod.EDTA.TElib.fa file for annotation?

oushujun commented 3 years ago

If you are working on an Arabidopsis genome, I suggest starting from scratch by removing all EDTA-generated files or start from a new folder. Attempting to recover from previous errors for a small genome and the following suspects will further delay your analysis.

Please use shorter sequence names, not file names.

zmz1988 commented 3 years ago

Thanks! I had tried (1) shortening the sequence name into only three letters, (2) running EDTA on fasta file with reducing size (containing only one chromosome), and (3) reducing the parameters with only --anno and threads perl /home/EDTA/EDTA.pl --genome necat_ragtag.fasta --cds Araport11_genes.202106.cds.fasta --curatedlib TAIR10_TE.fasta --threads 10 --anno 1 . But problem still remains, always can't find 'necat_ragtag.fasta.mod.EDTA.intact.fa.rename'... Unfortunately, I still got the same message...

I'm wondering what could cause the .fasta.mod.EDTA.intact.fa.rename file failed to build?

oushujun commented 3 years ago

Can you share a small sequence sample that reproduces the issue?

Thanks! Shujun

On Tue, Sep 14, 2021 at 6:30 PM zzz @.***> wrote:

Thanks! I had tried (1) shortening the sequence name into only three letters, (2) running EDTA on fasta file with reducing size (containing only one chromosome), and (3) reducing the parameters with only --anno and threads perl /home/EDTA/EDTA.pl --genome necat_ragtag.fasta --cds Araport11_genes.202106.cds.fasta --curatedlib TAIR10_TE.fasta --threads 10 --anno 1 . But problem still remains, always can't find 'necat_ragtag.fasta.mod.EDTA.intact.fa.rename'... Unfortunately, I still got the same message...

I'm wondering what could cause the .fasta.mod.EDTA.intact.fa.rename file failed to build?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/220#issuecomment-919582283, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NBW6PCZDHC6LLQRO2TUB7LJNANCNFSM5DY43MHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

zmz1988 commented 3 years ago

Sure. I already send it by email to you. Could you please have a check? Thanks in advance!

oushujun commented 3 years ago

Thanks for sharing your files. It appears that the curated library you provided is not a TE library, but seems like all TEs annotated in the genome. This is not recommended and not the design of this function. You may only provide non-redundant exemplar sequences via --curatedlib. Nevertheless, this is not the direct cause of your issue.

The real cause is the naming of these sequences, which do not follow the RepeatMasker naming convention. You may check out the libraries included in EDTA/database and mimic the naming formats. For example, the sequence >AT1TE52125|-|15827287|15838845|ATHILA2|LTR/Gypsy|11559 bp can be formatted as >AT1TE52125#LTR/Gypsy. If you don't know any classification information (which in this case you may not include it in the curated library), you can put something as ambiguous as Unknown_00001#unknown/unknown.

Shujun

zmz1988 commented 3 years ago

Thanks a lot, @oushujun! I should've checked the TE file more carefully!

I got the TE file from this question [https://github.com/oushujun/EDTA/issues/198], and thought probably I can use it directly. I will try to change the TE group naming in this file as you suggested!

Many thanks agin!!!

oushujun commented 3 years ago

You may search for a true Ath TE database. This is not a TE library. I think it's somewhere in TAIR or Arapoart, or at least the repbase version is close enough.

Shujun

On Thu, Sep 16, 2021 at 5:31 AM zzz @.***> wrote:

Thanks a lot, @oushujun https://github.com/oushujun! I should've checked the TE file more carefully!

I got the TE file from this question [https://github.com//issues/198 https://github.com/oushujun/EDTA/issues/198], and thought probably I can use it directly. I will tried to change the TE group naming in this file as you suggested!

Many thanks agin!!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/220#issuecomment-920782181, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NCKAOG2F45EIKX5ZY3UCHBPRANCNFSM5DY43MHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.