oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
346 stars 73 forks source link

Trouble with EDTA advance filtering #280

Closed MrbrilliantLL closed 2 years ago

MrbrilliantLL commented 2 years ago

Hello Shujun,

I was running EDTA v2.0.1 on a maize genome with the following command lib=/data/songlab/new_Mustart/gene_annotation/ab-ini/Db/NAM_TElib.fasta genome=/data/songlab/Mu/w22/w22_unmapped.fasta genome_out=/data/songlab/new_Mustart/TE/w22/w22_unmapped.fasta.out cds=/data/songlab/new_Mustart/TE/w22/Zm-W22-REFERENCE-NRGENE-2.0_Zm00004b.1.cds.fa EDTA.pl --genome $genome --species Maize -t 32 \ --anno 1 --rmout $genome_out --curatedlib $lib --cds $cds \ --repeatmasker /data/songlab/tools/micromamba/envs/repeatmasker/bin/RepeatMasker

and got the following error: `2022年 07月 09日 星期六 12:18:26 CST Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library:

No such file or directory at /data/songlab/tools/micromamba/envs/EDTA/share/EDTA/util/TE_purifier.pl line 105.

Input file "w22_unmapped.fasta.mod.LTR.raw.fa-w22_unmapped.fasta.mod.TIR.raw.fa.fa" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
Options:
    -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
    -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
    -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
    -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
    -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
    -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
    -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
    -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
    -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
    -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
    -trf_path   path    Path to the trf program

The TE1 file w22_unmapped.fasta.mod.LTR.raw.fa.HQ is not found or it's empty!

A script to purify a TE library based on another TE file containing the target contaminant.
This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
    Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
    options:    -TE1    [fasta] The file to be purified.
            -TE2    [fasta] The file that mainly consists of TE1 contaminants.
            -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
            -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
            -miniden    [int]   The minimum identity (%) to be considered a real match. Default: 60
            -mindiff    [float] The minimum fold difference in richness between TE1 and TE2 for a 
                        sequence to be considered as real to TE1.
            -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
            -blastplus  [path]  The directory containing Blastn (default: read from ENV)
            -threads    [int]   Number of theads to run this script
            -help|-h    Display this help info

Input file "w22_unmapped.fasta.mod.LTR.raw.fa.HQ-w22_unmapped.fasta.mod.Helitron.raw.fa.fa" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
Options:
    -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
    -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
    -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
    -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
    -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
    -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
    -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
    -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
    -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
    -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
    -trf_path   path    Path to the trf program

No such file or directory at /data/songlab/tools/micromamba/envs/EDTA/share/EDTA/util/TE_purifier.pl line 105.

Input file "w22_unmapped.fasta.mod.Helitron.raw.fa-w22_unmapped.fasta.mod.TIR.raw.fa.fa" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
Options:
    -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
    -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
    -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
    -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
    -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
    -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
    -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
    -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
    -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
    -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
    -trf_path   path    Path to the trf program

The TE1 file w22_unmapped.fasta.mod.Helitron.raw.fa.HQ is not found or it's empty!

A script to purify a TE library based on another TE file containing the target contaminant.
This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
    Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
    options:    -TE1    [fasta] The file to be purified.
            -TE2    [fasta] The file that mainly consists of TE1 contaminants.
            -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
            -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
            -miniden    [int]   The minimum identity (%) to be considered a real match. Default: 60
            -mindiff    [float] The minimum fold difference in richness between TE1 and TE2 for a 
                        sequence to be considered as real to TE1.
            -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
            -blastplus  [path]  The directory containing Blastn (default: read from ENV)
            -threads    [int]   Number of theads to run this script
            -help|-h    Display this help info

Input file "w22_unmapped.fasta.mod.Helitron.raw.fa.HQ-w22_unmapped.fasta.mod.LTR.raw.fa.fa" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
Options:
    -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
    -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
    -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
    -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
    -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
    -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
    -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
    -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
    -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
    -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
    -trf_path   path    Path to the trf program

No such file or directory at /data/songlab/tools/micromamba/envs/EDTA/share/EDTA/util/TE_purifier.pl line 105.

Input file "w22_unmapped.fasta.mod.TIR.raw.fa-w22_unmapped.fasta.mod.LTR.raw.fa.fa" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
Options:
    -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
    -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
    -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
    -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
    -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
    -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
    -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
    -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
    -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
    -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
    -trf_path   path    Path to the trf program

The TE1 file w22_unmapped.fasta.mod.TIR.raw.fa.HQ is not found or it's empty!

A script to purify a TE library based on another TE file containing the target contaminant.
This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
    Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
    options:    -TE1    [fasta] The file to be purified.
            -TE2    [fasta] The file that mainly consists of TE1 contaminants.
            -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
            -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
            -miniden    [int]   The minimum identity (%) to be considered a real match. Default: 60
            -mindiff    [float] The minimum fold difference in richness between TE1 and TE2 for a 
                        sequence to be considered as real to TE1.
            -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
            -blastplus  [path]  The directory containing Blastn (default: read from ENV)
            -threads    [int]   Number of theads to run this script
            -help|-h    Display this help info

Input file "w22_unmapped.fasta.mod.TIR.raw.fa.HQ-w22_unmapped.fasta.mod.Helitron.raw.fa.fa" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
Options:
    -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
    -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
    -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
    -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
    -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
    -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
    -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
    -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
    -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
    -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
    -trf_path   path    Path to the trf program

RepeatMasker version 4.1.2-p1 Search Engine: NCBI/RMBLAST [ 2.11.0+ ] RepeatMasker::setspecies: Could not find user specified library w22_unmapped.fasta.mod.LTR.raw.fa.HQ, or the file is empty.

Input file "w22_unmapped.fasta.mod.TIR.Helitron.fa.stg1.raw.masked" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
Options:
    -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
    -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
    -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
    -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
    -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
    -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
    -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
    -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
    -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
    -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
    -trf_path   path    Path to the trf program

ERROR: Input sequence file is not exist!

Iteratively clean up nested TE insertions and remove redundancy.

Further info: Each sequence will be used as query to search the entire file. For a subject sequence containing >95% of the query sequence, the matching part in the subject will be removed. After removal, subject sequences shorter than the threadshold will be diacarded. The number of rounds of iterations is automatically decided (usually less than 8). User can also define this.

Usage: perl cleanup_nested.pl -in file.fasta [options] -in [file] Input sequence file in FASTA format -cov [float] Minimum coverage of the query sequence to be considered as nesting. Default: 0.95 -minlen [int] Minimum length of the clean sequence to retain. Default: 80 (bp) -miniden [int] Minimum identity of the clean sequence to retain. Default: 80 (%) -clean [int] Clean nested sequences (1) or not (0). Default: 1 -iter [int] Numbers of iteration to remove redundency. Default: automatic -blastplus [path] Path to the blastn and makeblastdb program. -threads|-t [int] Threads to run this script. Default: 4

cat: w22_unmapped.fasta.mod.TIR.Helitron.fa.stg1.raw.cln.cln: 没有那个文件或目录 2022年 07月 09日 星期六 12:47:17 CST EDTA advance filtering finished.`

But these error reports do not affect the generation of the final result (*.TEanno.sum).

Thank you very much for your time and consideration!

oushujun commented 2 years ago

Did you finished the EDTA_raw step? For small input files, you may want to consider the --force 1 parameter when some TE types are not detected.

Shujun

On Mon, Jul 18, 2022 at 10:03 PM MrbrilliantLL @.***> wrote:

Hello Shujun,

I was running EDTA v2.0.1 on a maize genome with the following command lib=/data/songlab/new_Mustart/gene_annotation/ab-ini/Db/NAM_TElib.fasta genome=/data/songlab/Mu/w22/w22_unmapped.fasta genome_out=/data/songlab/new_Mustart/TE/w22/w22_unmapped.fasta.out cds=/data/songlab/new_Mustart/TE/w22/Zm-W22-REFERENCE-NRGENE-2.0_Zm00004b.1.cds.fa EDTA.pl --genome $genome --species Maize -t 32 \ --anno 1 --rmout $genome_out --curatedlib $lib --cds $cds \ --repeatmasker /data/songlab/tools/micromamba/envs/repeatmasker/bin/RepeatMasker

and got the following error: `2022年 07月 09日 星期六 12:18:26 CST Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library:

No such file or directory at /data/songlab/tools/micromamba/envs/EDTA/share/EDTA/util/TE_purifier.pl line 105.

Input file "w22_unmapped.fasta.mod.LTR.raw.fa-w22_unmapped.fasta.mod.TIR.raw.fa.fa" not found!

Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options:

-misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters

-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1

-nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0

-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1

-minlen [int] Minimum sequence length filter after clean up; default: 100 (bp)

-maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp)

-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0

-cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).

-minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1.

-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1

-trf_path path Path to the trf program

The TE1 file w22_unmapped.fasta.mod.LTR.raw.fa.HQ is not found or it's empty!

A script to purify a TE library based on another TE file containing the target contaminant.

This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.

Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]

options: -TE1 [fasta] The file to be purified.

      -TE2    [fasta] The file that mainly consists of TE1 contaminants.

      -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).

      -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50

      -miniden    [int]   The minimum identity (%) to be considered a real match. Default: 60

      -mindiff    [float] The minimum fold difference in richness between TE1 and TE2 for a

                  sequence to be considered as real to TE1.

      -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)

      -blastplus  [path]  The directory containing Blastn (default: read from ENV)

      -threads    [int]   Number of theads to run this script

      -help|-h    Display this help info

Input file "w22_unmapped.fasta.mod.LTR.raw.fa.HQ-w22_unmapped.fasta.mod.Helitron.raw.fa.fa" not found!

Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options:

-misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters

-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1

-nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0

-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1

-minlen [int] Minimum sequence length filter after clean up; default: 100 (bp)

-maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp)

-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0

-cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).

-minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1.

-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1

-trf_path path Path to the trf program

No such file or directory at /data/songlab/tools/micromamba/envs/EDTA/share/EDTA/util/TE_purifier.pl line 105.

Input file "w22_unmapped.fasta.mod.Helitron.raw.fa-w22_unmapped.fasta.mod.TIR.raw.fa.fa" not found!

Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options:

-misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters

-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1

-nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0

-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1

-minlen [int] Minimum sequence length filter after clean up; default: 100 (bp)

-maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp)

-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0

-cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).

-minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1.

-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1

-trf_path path Path to the trf program

The TE1 file w22_unmapped.fasta.mod.Helitron.raw.fa.HQ is not found or it's empty!

A script to purify a TE library based on another TE file containing the target contaminant.

This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.

Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]

options: -TE1 [fasta] The file to be purified.

      -TE2    [fasta] The file that mainly consists of TE1 contaminants.

      -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).

      -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50

      -miniden    [int]   The minimum identity (%) to be considered a real match. Default: 60

      -mindiff    [float] The minimum fold difference in richness between TE1 and TE2 for a

                  sequence to be considered as real to TE1.

      -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)

      -blastplus  [path]  The directory containing Blastn (default: read from ENV)

      -threads    [int]   Number of theads to run this script

      -help|-h    Display this help info

Input file "w22_unmapped.fasta.mod.Helitron.raw.fa.HQ-w22_unmapped.fasta.mod.LTR.raw.fa.fa" not found!

Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options:

-misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters

-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1

-nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0

-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1

-minlen [int] Minimum sequence length filter after clean up; default: 100 (bp)

-maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp)

-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0

-cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).

-minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1.

-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1

-trf_path path Path to the trf program

No such file or directory at /data/songlab/tools/micromamba/envs/EDTA/share/EDTA/util/TE_purifier.pl line 105.

Input file "w22_unmapped.fasta.mod.TIR.raw.fa-w22_unmapped.fasta.mod.LTR.raw.fa.fa" not found!

Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options:

-misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters

-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1

-nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0

-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1

-minlen [int] Minimum sequence length filter after clean up; default: 100 (bp)

-maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp)

-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0

-cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).

-minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1.

-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1

-trf_path path Path to the trf program

The TE1 file w22_unmapped.fasta.mod.TIR.raw.fa.HQ is not found or it's empty!

A script to purify a TE library based on another TE file containing the target contaminant.

This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.

Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]

options: -TE1 [fasta] The file to be purified.

      -TE2    [fasta] The file that mainly consists of TE1 contaminants.

      -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).

      -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50

      -miniden    [int]   The minimum identity (%) to be considered a real match. Default: 60

      -mindiff    [float] The minimum fold difference in richness between TE1 and TE2 for a

                  sequence to be considered as real to TE1.

      -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)

      -blastplus  [path]  The directory containing Blastn (default: read from ENV)

      -threads    [int]   Number of theads to run this script

      -help|-h    Display this help info

Input file "w22_unmapped.fasta.mod.TIR.raw.fa.HQ-w22_unmapped.fasta.mod.Helitron.raw.fa.fa" not found!

Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options:

-misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters

-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1

-nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0

-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1

-minlen [int] Minimum sequence length filter after clean up; default: 100 (bp)

-maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp)

-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0

-cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).

-minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1.

-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1

-trf_path path Path to the trf program

RepeatMasker version 4.1.2-p1 Search Engine: NCBI/RMBLAST [ 2.11.0+ ] RepeatMasker::setspecies: Could not find user specified library w22_unmapped.fasta.mod.LTR.raw.fa.HQ, or the file is empty.

Input file "w22_unmapped.fasta.mod.TIR.Helitron.fa.stg1.raw.masked" not found!

Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options:

-misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters

-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1

-nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0

-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1

-minlen [int] Minimum sequence length filter after clean up; default: 100 (bp)

-maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp)

-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0

-cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).

-minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1.

-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1

-trf_path path Path to the trf program

ERROR: Input sequence file is not exist!

Iteratively clean up nested TE insertions and remove redundancy.

Further info: Each sequence will be used as query to search the entire file. For a subject sequence containing >95% of the query sequence, the matching part in the subject will be removed. After removal, subject sequences shorter than the threadshold will be diacarded. The number of rounds of iterations is automatically decided (usually less than 8). User can also define this.

Usage: perl cleanup_nested.pl -in file.fasta [options] -in [file] Input sequence file in FASTA format -cov [float] Minimum coverage of the query sequence to be considered as nesting. Default: 0.95 -minlen [int] Minimum length of the clean sequence to retain. Default: 80 (bp) -miniden [int] Minimum identity of the clean sequence to retain. Default: 80 (%) -clean [int] Clean nested sequences (1) or not (0). Default: 1 -iter [int] Numbers of iteration to remove redundency. Default: automatic -blastplus [path] Path to the blastn and makeblastdb program. -threads|-t [int] Threads to run this script. Default: 4

cat: w22_unmapped.fasta.mod.TIR.Helitron.fa.stg1.raw.cln.cln: 没有那个文件或目录 2022年 07月 09日 星期六 12:47:17 CST EDTA advance filtering finished.`

But these error reports do not affect the generation of the final result (*.TEanno.sum).

Thank you very much for your time and consideration!

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/280, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGQ6ZUKXT2OT3KMXDLVUYEIVANCNFSM536KAI4Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

MrbrilliantLL commented 2 years ago

Thank you for your quick reply!

I am using EDTA to annotate the maize genome from head to toe, and the EDTA_raw step runs successfully

2022年 07月 09日 星期六 12:18:16 CST Execution of EDTA_raw.pl is finished! 2022年 07月 09日 星期六 12:18:26 CST Obtain raw TE libraries finished. All intact TEs found by EDTA: w22_unmapped.fasta.mod.EDTA.intact.fa w22_unmapped.fasta.mod.EDTA.intact.gff3

The maize genome is supposed to contain the transposon involved in the error report, so I think the --force 1 parameter should not work.

oushujun commented 2 years ago

How large is this file? w22_unmapped.fasta Can you list the EDTA_raw directory for file sizes?

Thanks, Shujun

MrbrilliantLL commented 2 years ago

I found the cause of the problem, the Repeatmasker specified in the commit command is not in the environment that EDTA works in, which causes an error in the advanced filter step.

Thanks for your help, I will close this issue.

oushujun commented 2 years ago

Glad you figured it out! Do you mean the RepeatMasker you specified is not working properly under the EDTA conda environment?

Shujun

MrbrilliantLL commented 2 years ago

I used conda to install RepeatMasker into a separate environment, and then I specified '/envs/repeatmasker/bin/RepeatMasker' in EDTA's --repeatmasker. This will result in the error reported above.

Later I added '/envs/repeatmasker/bin/RepeatMasker' to the .bashrc environment variable and removed the Repeatmasker from 'EDTA/bin', which successfully solved the above problem.

I guess the advanced filtering step still calls the RepeatMasker in the EDTA environment, not the one specified by --repeatmasker.