oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

Use docker image to drive GitHub version EDTA #409

Open Wanjie-Feng opened 6 months ago

Wanjie-Feng commented 6 months ago

hi, shujun Thank you for developing this tool ! I ran into some problems while running this test file, and here is the code I ran:

perl ../EDTA.pl --genome genome.fa --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 80

The problematic output is as follows:

2023年 12月 11日 星期一 19:51:52 CST    Start to find LTR candidates.

2023年 12月 11日 星期一 19:51:52 CST    Identify LTR retrotransposon candidates from scratch.

awk: cannot open genome.fa.mod.retriever.scn.extend.fa.rexdb.cls.tsv (No such file or directory)
Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
2023年 12月 11日 星期一 19:52:11 CST    Finish finding LTR candidates.
2023年 12月 11日 星期一 19:52:54 CST    Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library: 

No such file or directory at /Data6/wanjie/MP/01hifiasm_hifi_ont_hic/02contig/07mapping_2_NCBI/Gp03/Repeatannotation/testEDTA/EDTA/util/TE_purifier.pl line 108.

        Input file "genome.fa.mod.LTR.raw.fa-genome.fa.mod.TIR.raw.fa.fa" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program

The TE1 file genome.fa.mod.LTR.raw.fa.HQ is not found or it's empty!

        A script to purify a TE library based on another TE file containing the target contaminant.
        This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
                Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
                options:        -TE1    [fasta] The file to be purified.
                                -TE2    [fasta] The file that mainly consists of TE1 contaminants.
                                -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
                                -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
                                -miniden        [int]   The minimum identity (%) to be considered a real match. Default: 60
                                -mindiff        [float] The minimum fold difference in richness between TE1 and TE2 for a 
                                                        sequence to be considered as real to TE1.
                                -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
                                -blastplus      [path]  The directory containing Blastn (default: read from ENV)
                                -threads        [int]   Number of theads to run this script
                                -help|-h        Display this help info

        Input file "genome.fa.mod.LTR.raw.fa.HQ-genome.fa.mod.Helitron.raw.fa.fa" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program

No such file or directory at /Data6/wanjie/MP/01hifiasm_hifi_ont_hic/02contig/07mapping_2_NCBI/Gp03/Repeatannotation/testEDTA/EDTA/util/TE_purifier.pl line 108.

        Input file "genome.fa.mod.Helitron.raw.fa-genome.fa.mod.TIR.raw.fa.fa" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program

The TE1 file genome.fa.mod.Helitron.raw.fa.HQ is not found or it's empty!

        A script to purify a TE library based on another TE file containing the target contaminant.
        This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
                Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
                options:        -TE1    [fasta] The file to be purified.
                                -TE2    [fasta] The file that mainly consists of TE1 contaminants.
                                -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
                                -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
                                -miniden        [int]   The minimum identity (%) to be considered a real match. Default: 60
                                -mindiff        [float] The minimum fold difference in richness between TE1 and TE2 for a 
                                                        sequence to be considered as real to TE1.
                                -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
                                -blastplus      [path]  The directory containing Blastn (default: read from ENV)
                                -threads        [int]   Number of theads to run this script
                                -help|-h        Display this help info

        Input file "genome.fa.mod.Helitron.raw.fa.HQ-genome.fa.mod.LTR.raw.fa.fa" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program

No such file or directory at /Data6/wanjie/MP/01hifiasm_hifi_ont_hic/02contig/07mapping_2_NCBI/Gp03/Repeatannotation/testEDTA/EDTA/util/TE_purifier.pl line 108.

        Input file "genome.fa.mod.TIR.raw.fa-genome.fa.mod.LTR.raw.fa.fa" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program

The TE1 file genome.fa.mod.TIR.raw.fa.HQ is not found or it's empty!

        A script to purify a TE library based on another TE file containing the target contaminant.
        This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
                Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
                options:        -TE1    [fasta] The file to be purified.
                                -TE2    [fasta] The file that mainly consists of TE1 contaminants.
                                -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
                                -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
                                -miniden        [int]   The minimum identity (%) to be considered a real match. Default: 60
                                -mindiff        [float] The minimum fold difference in richness between TE1 and TE2 for a 
                                                        sequence to be considered as real to TE1.
                                -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
                                -blastplus      [path]  The directory containing Blastn (default: read from ENV)
                                -threads        [int]   Number of theads to run this script
                                -help|-h        Display this help info

        Input file "genome.fa.mod.TIR.raw.fa.HQ-genome.fa.mod.Helitron.raw.fa.fa" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program

RepeatMasker version 4.1.5

WARNING: The nolow option should be used with caution.  This option
         doesn't simply filter out simple repeats and low-complexity
         annotations from the output, rather it doesn't run these
         searches at all.  The simple repeats, and low-complexity
         sequences may then be falsely annotated as fragments of
         TE families that contain short stretches of them.

Search Engine: NCBI/RMBLAST [ 2.14.1+ ]
RepeatMasker::setspecies: Could not find user specified library genome.fa.mod.LTR.raw.fa.HQ, or the file is empty.

        Input file "genome.fa.mod.TIR.Helitron.fa.stg1.raw.masked" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program

ERROR: Input sequence file is not exist!
Iteratively clean up nested TE insertions and remove redundancy.

Further info:
Each sequence will be used as query to search the entire file.
For a subject sequence containing >95% of the query sequence, the matching part in the subject will be removed.
After removal, subject sequences shorter than the threadshold will be diacarded.
The number of rounds of iterations is automatically decided (usually less than 8). User can also define this.

Usage:
perl cleanup_nested.pl -in file.fasta [options]
-in     [file]  Input sequence file in FASTA format
-cov    [float] Minimum coverage of the query sequence to be considered as nesting. Default: 0.95
-minlen [int]   Minimum length of the clean sequence to retain. Default: 80 (bp)
-miniden        [int]   Minimum identity of the clean sequence to retain. Default: 80 (%)
-clean  [int]   Clean nested sequences (1) or not (0). Default: 1
-iter   [int]   Numbers of iteration to remove redundency. Default: automatic
-blastplus [path]       Path to the blastn and makeblastdb program.
-threads|-t     [int]   Threads to run this script. Default: 4

cat: genome.fa.mod.TIR.Helitron.fa.stg1.raw.cln.cln: No such file or directory
2023年 12月 11日 星期一 19:53:02 CST    EDTA advance filtering finished.

2023年 12月 11日 星期一 19:53:02 CST    Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                                Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

cat: 'RM_*/consensi.fa': No such file or directory
                                RepeatModeler is finished, but no consensi.fa files found.

                                Skipping the CDS cleaning step (--cds [File]) since no CDS file is provided or it's empty.

2023年 12月 11日 星期一 19:53:40 CST    EDTA final stage finished! You may check out:
                                The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
2023年 12月 11日 星期一 19:53:40 CST    Perform post-EDTA analysis for whole-genome annotation:

I'm not sure if these errors have any effect. Can you give me some advice?

oushujun commented 6 months ago

Hi, this is a less-seen error. Can you test on the /test data and see if the pipeline was installed correctly?

Shujun

oushujun commented 5 months ago

@WAN-f12 any luck?

Wanjie-Feng commented 5 months ago

@oushujun
Thank you for your concern, I later changed the way I run, I used docker mirroring to combine the latest version of the EDTA.pl file to execute my file, no problem at all

oushujun commented 5 months ago

Can you please share your method here? Seems very creative!

Shujun

On Wed, Jan 10, 2024 at 8:54 PM wanjie @.***> wrote:

@oushujun https://github.com/oushujun Thank you for your concern, I later changed the way I run, I used docker mirroring to combine the latest version of the EDTA.pl file to execute my file, no problem at all

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/409#issuecomment-1886071898, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHBR5YYPF7PAAO5JKLYN5A4VAVCNFSM6AAAAABAPWMJL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGA3TCOBZHA . You are receiving this because you were mentioned.Message ID: @.***>

Wanjie-Feng commented 5 months ago

I first downloaded the docker version 2.0 image, and then cloned the latest version of the EDTA repository through git, which meant that I relied on docker's environment, but was running the latest version of EDTA. I was inspired by this quote from you: ”Once the conda environment is set up, you can use it to drive other versions of EDTA. For example, if you have the EDTA v1.9.6 installed via conda, you may git clone the latest version, activate the v1.9.6 conda env, If you have the edta v1.9.6 installed via conda, you may git clone the latest version, activate the v1.9.6 Conda env, then specify the path to the freshly cloned EDTA to use it.“

oushujun commented 5 months ago

@WAN-f12 Can you please share your commands doing so? Thanks!