oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
342 stars 73 forks source link

TIR learner issue - yield no DNA Transposon - others like LTR or Helitron works fine #154

Closed bioteksampath closed 3 years ago

bioteksampath commented 3 years ago

Hi Shujun, My EDTA run with both with test data and mydata does't yield any DNA TE but LTR and helitorn works okay. wondering what might be the issues,

My log report says - GAMAA library issue? do you have any solution? Thanks.

log repot from test data:

Tue Jan 26 19:24:08 CST 2021 Dependency checking: All passed!

    A custom library ../database/rice6.9.5.liban is provided via --curatedlib. Please make sure this is a manually curated library but not machine generated.

    A CDS file genome.cds.fa is provided via --cds. Please make sure this is the DNA sequence of coding regions only.

    A BED file is provided via --exclude. Regions specified by this file will be excluded from TE annotation and masking.

Tue Jan 26 19:24:12 CST 2021 Obtain raw TE libraries using various structure-based programs: Tue Jan 26 19:24:12 CST 2021 EDTA_raw: Check dependencies, prepare working directories.

Tue Jan 26 19:24:16 CST 2021 Start to find LTR candidates.

Tue Jan 26 19:24:16 CST 2021 Identify LTR retrotransposon candidates from scratch.

Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty. Tue Jan 26 19:24:52 CST 2021 Finish finding LTR candidates.

Tue Jan 26 19:24:52 CST 2021 Start to find TIR candidates.

Tue Jan 26 19:24:53 CST 2021 Identify TIR candidates from scratch.

Species: others Tue Jan 26 19:25:57 CST 2021 Finish finding TIR candidates.

Tue Jan 26 19:25:57 CST 2021 Start to find Helitron candidates.

Tue Jan 26 19:25:57 CST 2021 Identify Helitron candidates from scratch.

Tue Jan 26 19:26:34 CST 2021 Finish finding Helitron candidates.

Tue Jan 26 19:26:34 CST 2021 Execution of EDTA_raw.pl is finished!

Tue Jan 26 19:26:35 CST 2021 Obtain raw TE libraries finished. All intact TEs found by EDTA: genome.fa.mod.EDTA.intact.fa genome.fa.mod.EDTA.intact.gff3

Tue Jan 26 19:26:35 CST 2021 Perform EDTA advcance filtering for raw TE candidates and generate the stage 1 library:

Tue Jan 26 19:27:33 CST 2021 EDTA advcance filtering finished.

Tue Jan 26 19:27:33 CST 2021 Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                            Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

2021-01-26 19:28:43,667 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH 2021-01-26 19:28:43,699 -INFO- VARS: {'sequence': 'genome.fa.mod.RM.consensi.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'genome.fa.mod.RM.consensi.fa.rexdb', 'force_write_hmmscan': False, 'processors': 10, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0} 2021-01-26 19:28:43,699 -INFO- checking dependencies: 2021-01-26 19:28:43,744 -INFO- hmmer 3.3.1 OK 2021-01-26 19:28:44,058 -INFO- blastn 2.10.0+ OK 2021-01-26 19:28:44,061 -INFO- check database rexdb 2021-01-26 19:28:44,061 -INFO- db path: /home/sap223/anaconda3/envs/ET/lib/python3.6/site-packages/TEsorter/database 2021-01-26 19:28:44,061 -INFO- db file: REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-01-26 19:28:44,064 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm OK 2021-01-26 19:28:44,064 -INFO- Start classifying pipeline 2021-01-26 19:28:44,139 -INFO- total 1 sequences 2021-01-26 19:28:44,140 -INFO- translating genome.fa.mod.RM.consensi.fa in six frames /home/sap223/anaconda3/envs/ET/lib/python3.6/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. BiopythonWarning, 2021-01-26 19:28:44,170 -INFO- HMM scanning against /home/sap223/anaconda3/envs/ET/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-01-26 19:28:44,215 -INFO- Creating server instance (pp-1.6.4.4) 2021-01-26 19:28:44,215 -INFO- Running on Python 3.6.12 linux 2021-01-26 19:28:48,017 -INFO- pp local server started with 10 workers 2021-01-26 19:28:48,073 -INFO- Task 0 started 2021-01-26 19:28:48,074 -INFO- Task 1 started 2021-01-26 19:28:48,075 -INFO- Task 2 started 2021-01-26 19:28:48,076 -INFO- Task 3 started 2021-01-26 19:28:48,077 -INFO- Task 4 started 2021-01-26 19:28:48,077 -INFO- Task 5 started 2021-01-26 19:28:48,077 -INFO- Task 4 started 2021-01-26 19:28:48,077 -INFO- Task 5 started 2021-01-26 19:28:48,680 -INFO- generating gene anntations 2021-01-26 19:28:48,697 -INFO- 0 sequences classified by HMM 2021-01-26 19:28:48,697 -INFO- see protein domain sequences in genome.fa.mod.RM.consensi.fa.rexdb.dom.faa and annotation gff3 file in genome.fa.mod.RM.consensi.fa.rexdb.dom.gff3 2021-01-26 19:28:48,697 -WARNING- skipping pass-2 classification for zero classification in step-1 2021-01-26 19:28:48,697 -INFO- see classified sequences in genome.fa.mod.RM.consensi.fa.rexdb.cls.tsv 2021-01-26 19:28:48,698 -INFO- writing library for RepeatMasker in genome.fa.mod.RM.consensi.fa.rexdb.cls.lib 2021-01-26 19:28:48,703 -INFO- writing classified protein domains in genome.fa.mod.RM.consensi.fa.rexdb.cls.pep 2021-01-26 19:28:48,707 -INFO- Summary of classifications: Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains 2021-01-26 19:28:48,708 -INFO- Pipeline done. 2021-01-26 19:28:48,708 -INFO- cleaning the temporary directory ./tmp

    Input file "genome.fa.mod.RepeatModeler.raw.fa.masked" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa
    Options:
            -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
            -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
            -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
            -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
            -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
            -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
            -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
            -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
            -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
            -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
            -trf_path       path    Path to the trf program

Tue Jan 26 19:29:08 CST 2021 Clean up TE-related sequences in the CDS file with TEsorter:

2021-01-26 19:29:09,952 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH 2021-01-26 19:29:09,978 -INFO- VARS: {'sequence': 'genome.cds.fa.code', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'genome.cds.fa.code.rexdb', 'force_write_hmmscan': False, 'processors': 10, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0} 2021-01-26 19:29:09,978 -INFO- checking dependencies: 2021-01-26 19:29:10,004 -INFO- hmmer 3.3.1 OK 2021-01-26 19:29:10,320 -INFO- blastn 2.10.0+ OK 2021-01-26 19:29:10,323 -INFO- check database rexdb 2021-01-26 19:29:10,323 -INFO- db path: /home/sap223/anaconda3/envs/ET/lib/python3.6/site-packages/TEsorter/database 2021-01-26 19:29:10,323 -INFO- db file: REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-01-26 19:29:10,324 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm OK 2021-01-26 19:29:10,324 -INFO- Start classifying pipeline 2021-01-26 19:29:10,397 -INFO- total 139 sequences 2021-01-26 19:29:10,397 -INFO- translating genome.cds.fa.code in six frames /home/sap223/anaconda3/envs/ET/lib/python3.6/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. BiopythonWarning, 2021-01-26 19:29:10,675 -INFO- HMM scanning against /home/sap223/anaconda3/envs/ET/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-01-26 19:29:10,737 -INFO- Creating server instance (pp-1.6.4.4) 2021-01-26 19:29:10,737 -INFO- Running on Python 3.6.12 linux 2021-01-26 19:29:14,444 -INFO- pp local server started with 10 workers 2021-01-26 19:29:14,491 -INFO- Task 0 started 2021-01-26 19:29:14,493 -INFO- Task 1 started 2021-01-26 19:29:14,493 -INFO- Task 2 started 2021-01-26 19:29:14,493 -INFO- Task 3 started 2021-01-26 19:29:14,494 -INFO- Task 4 started 2021-01-26 19:29:14,495 -INFO- Task 5 started 2021-01-26 19:29:14,495 -INFO- Task 6 started 2021-01-26 19:29:14,497 -INFO- Task 7 started 2021-01-26 19:29:14,498 -INFO- Task 8 started 2021-01-26 19:29:14,499 -INFO- Task 9 started 2021-01-26 19:29:18,022 -INFO- generating gene anntations 2021-01-26 19:29:18,056 -INFO- 2 sequences classified by HMM 2021-01-26 19:29:18,056 -INFO- see protein domain sequences in genome.cds.fa.code.rexdb.dom.faa and annotation gff3 file in genome.cds.fa.code.rexdb.dom.gff3 2021-01-26 19:29:18,056 -INFO- classifying the unclassified sequences by searching against the classified ones 2021-01-26 19:29:18,079 -INFO- using the 80-80-80 rule 2021-01-26 19:29:18,079 -INFO- run CMD: makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl 2021-01-26 19:29:18,373 -INFO- run CMD: blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10 2021-01-26 19:29:18,771 -INFO- 1 sequences classified in pass 2 2021-01-26 19:29:18,772 -INFO- total 3 sequences classified. 2021-01-26 19:29:18,772 -INFO- see classified sequences in genome.cds.fa.code.rexdb.cls.tsv 2021-01-26 19:29:18,772 -INFO- writing library for RepeatMasker in genome.cds.fa.code.rexdb.cls.lib 2021-01-26 19:29:18,786 -INFO- writing classified protein domains in genome.cds.fa.code.rexdb.cls.pep 2021-01-26 19:29:18,792 -INFO- Summary of classifications: Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Gypsy 1 1 1 0 Maverick unknown 2 0 0 0 2021-01-26 19:29:18,792 -INFO- Pipeline done. 2021-01-26 19:29:18,792 -INFO- cleaning the temporary directory ./tmp Remove CDS-related sequences in the EDTA library:

Tue Jan 26 19:29:40 CST 2021 Combine the high-quality TE library rice6.9.5.liban with the EDTA library:

Tue Jan 26 19:29:54 CST 2021 EDTA final stage finished! You may check out: The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa Family names of intact TEs have been updated by rice6.9.5.liban: genome.fa.mod.EDTA.intact.gff3 Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa

Tue Jan 26 19:29:54 CST 2021 Perform post-EDTA analysis for whole-genome annotation:

Tue Jan 26 19:29:54 CST 2021 Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.

Tue Jan 26 19:30:06 CST 2021 TE annotation using the EDTA library has finished! Check out: Whole-genome TE annotation (total TE: 35.78%): genome.fa.mod.EDTA.TEanno.gff3 Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum Low-threshold TE masking for MAKER gene annotation (masked: 16.32%): genome.fa.mod.MAKER.masked

Tue Jan 26 19:30:06 CST 2021 Evaluate the level of inconsistency for whole-genome TE annotation (slow step):

Tue Jan 26 19:32:02 CST 2021 Evaluation of TE annotation finished! Check out these files:

                            Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
                            Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
                            Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum
oushujun commented 3 years ago

Hi Sam,

Your log looks fine with me. You can ignore the dmraa warning because it does no harm to the annotation. Please check the genome.fa.mod.EDTA.TEanno.sum file for summary of TE annotations. If you encounter TIR-related errors, please post them here.

Best, Shujun