oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
330 stars 72 forks source link

duplicate library sequences with the same id but different sequence #184

Closed zhangrengang closed 3 years ago

zhangrengang commented 3 years ago

In the final library file *.EDTA.TElib.fa., a few elements with the same id but different sequence were found, for example:

>TE_00000453_#LTR/unknown
AATTTTGTTAAATGGTTTTCATCACAATGGTGCCGATAAGTGTATGTATTCCAAATTTACAAAAGATTTTGGTGTGATTATTTGTCTCTACATAGATGACATGTTAATATTTAGCACCAATATGATTGGAATAGTTGAAACCAAAAGGTATCTCACTTCTATCTTTAAAATGAAAGATCTTGGTGAAGTGGATACAATTTTAGGTATCAAAGTTAAGAAACATAGTAGTGGCTATGCACTTAATCAGTCATATTATATTGAGAAAATGCTTGATAAGTTTAAGCATCTCAATATAAAGGAGGCTAATACCCCATTTGACTCTAGCATGAAGTTAAATGATTATTGTGATAAAGCGGTAGCACAACTAGAATATGCTAGTTATATGCAAGTTATCAAGATATACAAGCAAGCCGAATACAGATCATTGGAAGGCTATTGCAAGAGTCTTTGGTTACCTAAAAAGAACGATCGATTTGGGCTTGTTTTATTCTGATTTTCCAGTTGTGATGGAAGGATATGTGATGCAAGTTGGATAACTAGTTCGAGTGATAATATCAGAATTTATTGTTATGGCTGCTGCAGGTAAAGAAGCAGAATGGCTAAGAAATATGTTGTTGATATTAAGTTGTGGCCACAACCTATGTCAGCTATTTCTTTATACTGCGATAGTGAAGCAACTATGTCTCGAGCTTATAGTAACATTTACAATGGTAAGTCAAGACATATAAGCATTCGACATGGATATATTCGAGAGTTGATTACAAATGGGGTAATCACCATTGTCTATGTGAAGTCTGTGAATAATTTAGCGGATCCGCTCACAAAAGGACTATCTAGAGACATGGTAGAAAAACAACTAATGGAATGGGGTTGAAACCCGTTATTAAAGATACCGGTAATGGGAACCCAACTTCGGATCAACAAGAAGCTTATCTCTAAGTTTAATGGGTAATAACAAGTTACTGTTTAGTATCTGTTGGACACTGATAATTAATTTTAGACCCTATTCTGATAGTATTCAGTGTGTTCTATTACGTAAAGGAGGATGAGCGTAGGCTCTTAATGGAATTTAAAGTTCGTGTTTAATGTAATAGAGACATGTATAATTCCACCTATATGAATATAGAAGTGGTGCCGCTTTTGACAAGAGTTAGGGTTTTCTCTTGTAAATATTCATGAAAATAAGATTTTAGCACATGGCCATAATAGTGCTAAACAGTTGTAAACCTCTTTAAGAGTTTGGATAGTATTATGTGTGTAGTATCTTTTATTCTACAACAAAAGTTTTGGTTTAATCTGCGGACACCAATAACTTTAGTAGGATTCAAGTTCTAACACTAATTGAAGGTTTAAATTGCAAAATACCTTCTTGTAAGCATAATTCTATCAAGTGAAAAGACATTCATTACAAACTAGTGGGGC
>TE_00000453_#LTR/unknown
ATAGCAAGCTACTGTGGTAGAAGACAAAACAAGTATTTGTCTTGACAAGTAAGGTAGTATAGATACAAAGGATTTGTAGTAAAAATACTCCTCTTGTAATCTTTTAAACTAGTGAAAATTGTCTATCCTGGGTTTGGCTGCCCCGAAGGGTTTTTTTTTATCTTTGAAAAGTTCTTCAAAAGGTTTTCCCTTCGTAACCAAATAACTTGTTCATTTAAGTTTTCCTGCACTTATATTTATTTGGTTACTGATCAATGTTTGCAAAGTGTTAGCAGATTGCTTCTTAACATACAAGGAACAAAAAGAGACTTTCAAGTGGTATCAGAGCAAGTTCACTCATTCTAGAGTGAGATCTATTTTTCCTATTGAACATGTCAACACTGACTTCACCACCGCAATTCAACTGGTGAGAACTATGCCTACTAGAAGGTGAGGATGAAAGCTTTCCTCAAATCACTAGACGAGAGAGTTGGATCCACTTATTGTAATTGGAATAACAAAGGCTTGAATGTTATATTCATGGCAATGTCTCCTGATGAATTTAAAAGAATCTCCATATGTGAGATAGCCAAGGAAGCTTGGGATATTCTTAAGTTACTCACGAGGAAACCAAGACGGTAAAAAATTCCAAATTATGATGCTAATCTCAAGATTTGAGGAGATAAAAATGCTGGAATATGAATCTTTTAATGAGTTTTACGCAAAAATTAATGATATTGTAAATTTCAAGTTCAATTTAAGAGAAAAAATAGAGGACTCGTGAATTGTAAAGAAGATTCTAAGATCCCTACTGAAAGATTTCGACCAAAGGTAACAACCATAAAAAAAGCAAGGATCTGGACACTGTACATGTTGAAGAATTGGTAGTTCCTTACAAACTTATGAATCTACATTGCCTCGTCAAAAGAAAAGTAAGTCCATTGCACTTAAATCTATTAAAAAGATATATTACTCTTCTGATAGTGATGATCTTAATAGTGAAGATATTGCTCTCGTAGCTAGAAATTTAGAAAATTCGTGTTTAAGAAAAAGAACAACGGTAAAGATAAAAAAAGGAAAAGATTTTGCCAAAAAAAATGATTAAAAATGGAATAAAGTAAAATTGAATCTAAAGAAAGAGTTAAGTGTTTTGAATGCTCAAGATATGGTCACTTAAGAAATGAATGTCCTAATTTCAAAAGAAATAAAGGAAAAGCCCTTAATGTTACATTAAGCGATGAATCTGATTCTAAAAATTTTAATTTTCATCTGATAATGAACTTGTTTTTGTTGCTTTTCTGGTCTTGTTAATGGATGCACTGACATGAACCAAATGTCAATACATTAAGTGATTCTAACAAATACCAAACTGTTAACAAATTTAGTAATTCTAACATAAATAAAACTGTTAGCTATTCTAGCGATTTATATATTACTTTGTTATTGCCGATCATGAATTAACCTTGCAAGAAACTTATGATGACTTATGTGAAGAAGCGTGAAAGTCAGGAAATTGTTCAACAAGCTGACAACTATGGAGAACAGGAAGAACAATCTAGCCAAGGCTTTAAAGCTATCAAAAGTTAAGATATCTAGATCTCTCACCAAGGTCAACCACTTGAGATAAAGTTGATAGTGTGCAAATAAATCAAGAAACAGTTAGCACACATAAACTTGATGGACTACTAAAAGTTGAAAGGGTCAACACTGATCGTGCATGTTTATGATACACAACAGATGAAAATTCAATCAACAATACATCGGTAACCTTCTCTTTACGAAAAAATTACCTTTTTAAAAGATAAATAGAAAAGATTCCATGCTTAAAGAAGAGTCAAACAAATAAGTTGAAAGGAGTATTCTTGTTTTAATTTCAATATTAGACCAGTCTAACTCCGTTTAAATTAAGGTTTTTGTTTGTCTTGCAATACTTTGTAAACTCAAAACTAGAGACTTTCTTCATAATAGAGGAATTGCTTTATGTTTAATGAAGTTAAGCCTAATTGGGATCGGACCCATACTCTATTTTTCACAAGTTGTTTCTTGGGTCAAATACCTTGATGATTTTTTGGA...
oushujun commented 3 years ago

Are these from the same library or different libraries? In the same library, the ID name (i.e., TE_00000453) of LTR retrotransposons could be identical, but the parts (i.e., _LTR or _INT) are different because I have LTR regions and internal regions separate. If they share the same ID, that means they are from the same family.

In your example, TE_00000453_ is the family ID and LTR/unknown is the classification separated by #. Apparently the family ID is not complete (lacking _LTR or _INT). You may want to check if the run is successful or which version of EDTA it was generated from.

zhangrengang commented 3 years ago

The version of EDTA is v1.9.4. The run seems to be successful as no error was raised. Not all the LTR IDs are lacking _LTR or _INT, for example:

>TE_00000818_#LTR/unknown
>TE_00000881_#LTR/unknown
>TE_00000946_#LTR/unknown
>TE_00001023_INT#LTR/Gypsy
>TE_00001098_INT#LTR/unknown
>TE_00001211_INT#LTR/Copia
>TE_00001417_INT#LTR/unknown
>TE_00001453_INT#LTR/unknown

And only a few IDs that are lacking _LTR or _INT is duplicated. These abnormal IDs seem to be elements by RepeatModeler, as IDs in the file *.EDTA.raw.fa are like this:

>RM_00000695#LTR/unknown
>RM_00001070#LTR/unknown
>RM_00001086#LTR/unknown
>RM_00001235#LTR/unknown
>TE_00000644_INT#LTR/Gypsy
>TE_00001676_LTR#LTR/Copia
>TE_00002894_LTR#LTR/Copia
>TE_00002886_INT#LTR/Gypsy
>TE_00001978_LTR#LTR/unknown
zhangrengang commented 3 years ago

I map the duplicated IDs by searching element sequences: IDs in EDTA.TElib.fa:

>TE_00000453_#LTR/unknown
>TE_00000453_#LTR/unknown
>TE_00000453#Unknown

with mapped IDs in EDTA.raw.fa:

>RM_00000771#LTR/unknown
>RM_00000772#LTR/unknown
>RM_00000773#Unknown
oushujun commented 3 years ago

This should be a bug in renaming LTR candidates from RepeatModeler to the EDTA library. I will take a look at the codes.

SolomiyaHn commented 3 years ago

Hi Shujun, I am having a similar issue. I am using the EDTA v1.9.7. I found the following 3 LTR entries with just an underscore without an 'INT' or an 'LTR' in *EDTA.TElib.fa:

TE_00000145#Unknown
TE_00000146#Unknown
TE_00000147_#LTR/unknown
TE_00000435_#LTR/unknown
TE_00000478_#LTR/unknown
TE_00000148#Unknown
TE_00000149#Unknown

Two of the ID numbers are repeated in another fasta entry:

TE_00000435#Unknown
TE_00000478#Unknown

I also found one fasta sequence named TE_00000558#Helitron while the rest of the Helitrons are named in the following way : TE_00000712#DNA/Helitron I assume this is because the one Helitron was found by Repeatmodeler, but i suspect this difference in classification will cause issues in my downstream analysis.

I know you are looking into the first issue. Do you know approximately how long the bug fix will take? Would you recommend manually changing the IDs or removing the problematic entries?

I have included my log file in case you can spot any problematic errors.

Thank you, Solomiya

########################################################
##### Extensive de-novo TE Annotator (EDTA) v1.9.7  ####
##### Shujun Ou (shujun.ou.1@gmail.com)             ####
########################################################

Wed Apr 14 00:44:21 EDT 2021    Dependency checking:
                                All passed!

Wed Apr 14 00:44:35 EDT 2021    Obtain raw TE libraries using various structure-based programs:
Wed Apr 14 00:44:35 EDT 2021    EDTA_raw: Check dependencies, prepare working directories.

Wed Apr 14 00:44:46 EDT 2021    Start to find LTR candidates.

Wed Apr 14 00:44:46 EDT 2021    Existing result file Amaranthus.fasta.mod.LTR.raw.fa found!
                                Will keep this file without rerunning this module.
                                Please specify --overwrite 1 if you want to rerun this module.

Wed Apr 14 00:44:47 EDT 2021    Finish finding LTR candidates.

Wed Apr 14 00:44:47 EDT 2021    Start to find TIR candidates.

Wed Apr 14 00:44:47 EDT 2021    Existing result file Amaranthus.fasta.mod.TIR.raw.fa found!
                                Will keep this file without rerunning this module.
                                Please specify --overwrite 1 if you want to rerun this module.

Wed Apr 14 00:44:48 EDT 2021    Finish finding TIR candidates.

Wed Apr 14 00:44:48 EDT 2021    Start to find Helitron candidates.

Wed Apr 14 00:44:48 EDT 2021    Existing result file Amaranthus.fasta.mod.Helitron.raw.fa found!
                                Will keep this file without rerunning this module.
                                Please specify --overwrite 1 if you want to rerun this module.

Wed Apr 14 00:44:49 EDT 2021    Finish finding Helitron candidates.

Wed Apr 14 00:44:49 EDT 2021    Execution of EDTA_raw.pl is finished!

Wed Apr 14 00:44:49 EDT 2021    Obtain raw TE libraries finished.
                                All intact TEs found by EDTA:
                                        Amaranthus.fasta.mod.EDTA.intact.fa
                                        Amaranthus.fasta.mod.EDTA.intact.gff3

Wed Apr 14 00:44:49 EDT 2021    Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library:

Wed Apr 14 03:35:03 EDT 2021    EDTA advance filtering finished.

Wed Apr 14 03:35:03 EDT 2021    Perform EDTA final steps to generate a non-redundant comprehensive TE library:
                                Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

2021-04-15 07:24:44,322 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library.  Please specify its full path using the environment variable DRMAA_LIBRARY_PATH
2021-04-15 07:24:44,385 -INFO- VARS: {'sequence': 'Amaranthus.fasta.mod.RM.consensi.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'Amaranthus.fasta.mod.RM.consensi.fa.rexdb', 'force_write_hmmscan': False, 'processors': 10, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0}
2021-04-15 07:24:44,385 -INFO- checking dependencies:
2021-04-15 07:24:44,447 -INFO- hmmer    3.3.1   OK
2021-04-15 07:24:44,560 -INFO- blastn   2.10.0+ OK
2021-04-15 07:24:44,561 -INFO- check database rexdb
2021-04-15 07:24:44,562 -INFO- db path: ~/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/database
2021-04-15 07:24:44,562 -INFO- db file: REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm
2021-04-15 07:24:44,596 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm    OK
2021-04-15 07:24:44,596 -INFO- Start classifying pipeline
2021-04-15 07:24:44,942 -INFO- total 1185 sequences
2021-04-15 07:24:44,943 -INFO- translating `Amaranthus.fasta.mod.RM.consensi.fa` in six frames
~/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  BiopythonWarning,
2021-04-15 07:24:46,174 -INFO- HMM scanning against `~/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm`
2021-04-15 07:24:46,258 -INFO- Creating server instance (pp-1.6.4.4)
2021-04-15 07:24:46,258 -INFO- Running on Python 3.6.12 linux
2021-04-15 07:24:47,109 -INFO- pp local server started with 10 workers
2021-04-15 07:24:47,299 -INFO- Task 0 started
2021-04-15 07:24:47,301 -INFO- Task 1 started
2021-04-15 07:24:47,302 -INFO- Task 2 started
2021-04-15 07:24:47,303 -INFO- Task 3 started
2021-04-15 07:24:47,304 -INFO- Task 4 started
2021-04-15 07:24:47,304 -INFO- Task 5 started
2021-04-15 07:24:47,306 -INFO- Task 6 started
2021-04-15 07:24:47,307 -INFO- Task 7 started
2021-04-15 07:24:47,307 -INFO- Task 8 started
2021-04-15 07:24:47,308 -INFO- Task 9 started
2021-04-15 07:25:02,595 -INFO- generating gene anntations
2021-04-15 07:25:02,719 -INFO- 61 sequences classified by HMM
2021-04-15 07:25:02,719 -INFO- see protein domain sequences in `Amaranthus.fasta.mod.RM.consensi.fa.rexdb.dom.faa` and annotation gff3 file in `Amaranthus.fasta.mod.RM.consensi.fa.rexdb.dom.gff3`
2021-04-15 07:25:02,720 -INFO- classifying the unclassified sequences by searching against the classified ones
2021-04-15 07:25:02,760 -INFO- using the 80-80-80 rule
2021-04-15 07:25:02,760 -INFO- run CMD: `makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl`
2021-04-15 07:25:02,866 -INFO- run CMD: `blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10`
2021-04-15 07:25:03,263 -INFO- 4 sequences classified in pass 2
2021-04-15 07:25:03,264 -INFO- total 65 sequences classified.
2021-04-15 07:25:03,264 -INFO- see classified sequences in `Amaranthus.fasta.mod.RM.consensi.fa.rexdb.cls.tsv`
2021-04-15 07:25:03,264 -INFO- writing library for RepeatMasker in `Amaranthus.fasta.mod.RM.consensi.fa.rexdb.cls.lib`
2021-04-15 07:25:03,325 -INFO- writing classified protein domains in `Amaranthus.fasta.mod.RM.consensi.fa.rexdb.cls.pep`
2021-04-15 07:25:03,329 -INFO- Summary of classifications:
Order           Superfamily      # of Sequences# of Clade Sequences    # of Clades# of full Domains
LTR             Bel-Pao                       1              0              0              0
LTR             Copia                         5              5              3              0
LTR             Gypsy                         9              9              2              1
LINE            unknown                      25              0              0              0
TIR             EnSpm_CACTA                   6              0              0              0
TIR             MuDR_Mutator                 14              0              0              0
TIR             PIF_Harbinger                 1              0              0              0
TIR             hAT                           3              0              0              0
Helitron        unknown                       1              0              0              0
2021-04-15 07:25:03,329 -INFO- Pipeline done.
2021-04-15 07:25:03,329 -INFO- cleaning the temporary directory ./tmp
                                Skipping the CDS cleaning step (--cds [File]) since no CDS file is provided or it's empty.

Thu Apr 15 10:40:56 EDT 2021    EDTA final stage finished! You may check out:
                                The final EDTA TE library: Amaranthus.fasta.mod.EDTA.TElib.fa
oushujun commented 3 years ago

Hi, Thanks for the report. You may manually rename these duplicated IDs At the moment.

Best, Shujun

On Sun, Apr 18, 2021 at 2:47 AM Ivannahnat @.***> wrote:

Hi Shujun, I am having a similar issue. I am using the EDTA v1.9.7. I found the following 3 LTR entries with just an underscore without an 'INT' or an 'LTR' in *EDTA.TElib.fa:

TE_00000145#Unknown TE_00000146#Unknown TE00000147#LTR/unknown TE00000435#LTR/unknown TE00000478#LTR/unknown TE_00000148#Unknown TE_00000149#Unknown

Two of the ID numbers are repeated in another fasta entry:

TE_00000435#Unknown TE_00000478#Unknown

I also found one fasta sequence named TE_00000558#Helitron while the rest of the Helitrons are named in the following way : TE_00000712#DNA/Helitron I assume this is because the one Helitron was found by Repeatmodeler, but i suspect this difference in classification will cause issues in my downstream analysis.

I know you are looking into the first issue. Do you know approximately how long the bug fix will take? Would you recommend manually changing the IDs or removing the problematic entries?

I have included my log file in case you can spot any problematic errors.

Thank you, Solomiya

########################################################

Extensive de-novo TE Annotator (EDTA) v1.9.7
Shujun Ou @.***)

########################################################

Wed Apr 14 00:44:21 EDT 2021 Dependency checking: All passed!

Wed Apr 14 00:44:35 EDT 2021 Obtain raw TE libraries using various structure-based programs: Wed Apr 14 00:44:35 EDT 2021 EDTA_raw: Check dependencies, prepare working directories.

Wed Apr 14 00:44:46 EDT 2021 Start to find LTR candidates.

Wed Apr 14 00:44:46 EDT 2021 Existing result file Amaranthus.fasta.mod.LTR.raw.fa found! Will keep this file without rerunning this module. Please specify --overwrite 1 if you want to rerun this module.

Wed Apr 14 00:44:47 EDT 2021 Finish finding LTR candidates.

Wed Apr 14 00:44:47 EDT 2021 Start to find TIR candidates.

Wed Apr 14 00:44:47 EDT 2021 Existing result file Amaranthus.fasta.mod.TIR.raw.fa found! Will keep this file without rerunning this module. Please specify --overwrite 1 if you want to rerun this module.

Wed Apr 14 00:44:48 EDT 2021 Finish finding TIR candidates.

Wed Apr 14 00:44:48 EDT 2021 Start to find Helitron candidates.

Wed Apr 14 00:44:48 EDT 2021 Existing result file Amaranthus.fasta.mod.Helitron.raw.fa found! Will keep this file without rerunning this module. Please specify --overwrite 1 if you want to rerun this module.

Wed Apr 14 00:44:49 EDT 2021 Finish finding Helitron candidates.

Wed Apr 14 00:44:49 EDT 2021 Execution of EDTA_raw.pl is finished!

Wed Apr 14 00:44:49 EDT 2021 Obtain raw TE libraries finished. All intact TEs found by EDTA: Amaranthus.fasta.mod.EDTA.intact.fa Amaranthus.fasta.mod.EDTA.intact.gff3

Wed Apr 14 00:44:49 EDT 2021 Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library:

Wed Apr 14 03:35:03 EDT 2021 EDTA advance filtering finished.

Wed Apr 14 03:35:03 EDT 2021 Perform EDTA final steps to generate a non-redundant comprehensive TE library: Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

2021-04-15 07:24:44,322 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH 2021-04-15 07:24:44,385 -INFO- VARS: {'sequence': 'Amaranthus.fasta.mod.RM.consensi.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'Amaranthus.fasta.mod.RM.consensi.fa.rexdb', 'force_write_hmmscan': False, 'processors': 10, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0} 2021-04-15 07:24:44,385 -INFO- checking dependencies: 2021-04-15 07:24:44,447 -INFO- hmmer 3.3.1 OK 2021-04-15 07:24:44,560 -INFO- blastn 2.10.0+ OK 2021-04-15 07:24:44,561 -INFO- check database rexdb 2021-04-15 07:24:44,562 -INFO- db path: ~/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/database 2021-04-15 07:24:44,562 -INFO- db file: REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-04-15 07:24:44,596 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm OK 2021-04-15 07:24:44,596 -INFO- Start classifying pipeline 2021-04-15 07:24:44,942 -INFO- total 1185 sequences 2021-04-15 07:24:44,943 -INFO- translating Amaranthus.fasta.mod.RM.consensi.fa in six frames ~/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. BiopythonWarning, 2021-04-15 07:24:46,174 -INFO- HMM scanning against ~/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-04-15 07:24:46,258 -INFO- Creating server instance (pp-1.6.4.4) 2021-04-15 07:24:46,258 -INFO- Running on Python 3.6.12 linux 2021-04-15 07:24:47,109 -INFO- pp local server started with 10 workers 2021-04-15 07:24:47,299 -INFO- Task 0 started 2021-04-15 07:24:47,301 -INFO- Task 1 started 2021-04-15 07:24:47,302 -INFO- Task 2 started 2021-04-15 07:24:47,303 -INFO- Task 3 started 2021-04-15 07:24:47,304 -INFO- Task 4 started 2021-04-15 07:24:47,304 -INFO- Task 5 started 2021-04-15 07:24:47,306 -INFO- Task 6 started 2021-04-15 07:24:47,307 -INFO- Task 7 started 2021-04-15 07:24:47,307 -INFO- Task 8 started 2021-04-15 07:24:47,308 -INFO- Task 9 started 2021-04-15 07:25:02,595 -INFO- generating gene anntations 2021-04-15 07:25:02,719 -INFO- 61 sequences classified by HMM 2021-04-15 07:25:02,719 -INFO- see protein domain sequences in Amaranthus.fasta.mod.RM.consensi.fa.rexdb.dom.faa and annotation gff3 file in Amaranthus.fasta.mod.RM.consensi.fa.rexdb.dom.gff3 2021-04-15 07:25:02,720 -INFO- classifying the unclassified sequences by searching against the classified ones 2021-04-15 07:25:02,760 -INFO- using the 80-80-80 rule 2021-04-15 07:25:02,760 -INFO- run CMD: makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl 2021-04-15 07:25:02,866 -INFO- run CMD: blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10 2021-04-15 07:25:03,263 -INFO- 4 sequences classified in pass 2 2021-04-15 07:25:03,264 -INFO- total 65 sequences classified. 2021-04-15 07:25:03,264 -INFO- see classified sequences in Amaranthus.fasta.mod.RM.consensi.fa.rexdb.cls.tsv 2021-04-15 07:25:03,264 -INFO- writing library for RepeatMasker in Amaranthus.fasta.mod.RM.consensi.fa.rexdb.cls.lib 2021-04-15 07:25:03,325 -INFO- writing classified protein domains in Amaranthus.fasta.mod.RM.consensi.fa.rexdb.cls.pep 2021-04-15 07:25:03,329 -INFO- Summary of classifications: Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Bel-Pao 1 0 0 0 LTR Copia 5 5 3 0 LTR Gypsy 9 9 2 1 LINE unknown 25 0 0 0 TIR EnSpm_CACTA 6 0 0 0 TIR MuDR_Mutator 14 0 0 0 TIR PIF_Harbinger 1 0 0 0 TIR hAT 3 0 0 0 Helitron unknown 1 0 0 0 2021-04-15 07:25:03,329 -INFO- Pipeline done. 2021-04-15 07:25:03,329 -INFO- cleaning the temporary directory ./tmp Skipping the CDS cleaning step (--cds [File]) since no CDS file is provided or it's empty.

Thu Apr 15 10:40:56 EDT 2021 EDTA final stage finished! You may check out: The final EDTA TE library: Amaranthus.fasta.mod.EDTA.TElib.fa

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/184#issuecomment-821869302, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGRLUYCYO47ZRWIPJDTJHJTPANCNFSM422LKPQQ .

oushujun commented 3 years ago

@zhangrengang I have the duplicated ID issue fixed in the newly pushed version. @Ivannahnat I also fixed the Helitron naming issue as well as other naming issues.

Thank you all for your contributions. - Shujun

C-grapes commented 3 years ago

Hi shujun, I have doubts about the latest version of EDTA. The latest version I saw on https://anaconda.org/bioconda/edta/files is 1.9.6.2. What I download now is the installation package noarch/edta-1.9.6-hdfd78af_2.tar.bz2, and it shows version 1.9.6 when it runs. How can I find version 1.9.7? I also encountered the similar issue as the above two people. My TElib.fa file contains a lot of results like TE00000062#LTR/unknown, and there is also the Helitron naming issue, as Ivannahnat said. Has this bug been fixed in version 1.9.6?

TE00000062#LTR/unknown TE00000104#LTR/unknown TE00000112#LTR/unknown TE00000116#LTR/unknown TE00000126#LTR/unknown TE00000132#LTR/unknown TE00000137#LTR/unknown TE00000141#LTR/unknown TE00000186#LTR/unknown TE00000192#LTR/unknown TE00000246#LTR/unknown TE00000303#LTR/unknown TE00000589#LTR/unknown TE00000650#LTR/unknown TE00000678#LTR/unknown TE00000767#LTR/unknown TE00000868#LTR/unknown TE00001026#LTR/unknown TE00001071#LTR/unknown

########################################################

Extensive de-novo TE Annotator (EDTA) v1.9.6
Shujun Ou (shujun.ou.1@gmail.com)

########################################################

Best wish ! putao

oushujun commented 3 years ago

Hi putao,

You may directly pull the github version for the latest updates. The conda env of v1.9.6 can be used to drive the github version.

Best, Shujun

On Wed, May 12, 2021 at 12:07 AM C-grapes @.***> wrote:

Hi shujun, I have doubts about the latest version of EDTA. The latest version I saw on https://anaconda.org/bioconda/edta/files is 1.9.6.2. What I download now is the installation package noarch/edta-1.9.6-hdfd78af_2.tar.bz2, and it shows version 1.9.6 when it runs. How can I find version 1.9.7? I also encountered the similar issue as the above two people. My TElib.fa file contains a lot of results like TE00000062#LTR/unknown, and there is also the Helitron naming issue, as Ivannahnat said. Has this bug been fixed in version 1.9.6?

TE00000062#LTR/unknown TE00000104#LTR/unknown TE00000112#LTR/unknown TE00000116#LTR/unknown TE00000126#LTR/unknown TE00000132#LTR/unknown TE00000137#LTR/unknown TE00000141#LTR/unknown TE00000186#LTR/unknown TE00000192#LTR/unknown TE00000246#LTR/unknown TE00000303#LTR/unknown TE00000589#LTR/unknown TE00000650#LTR/unknown TE00000678#LTR/unknown TE00000767#LTR/unknown TE00000868#LTR/unknown TE00001026#LTR/unknown TE00001071#LTR/unknown

######################################################## Extensive de-novo TE Annotator (EDTA) v1.9.6 Shujun Ou ( @.***)

########################################################

Best wish ! putao

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/184#issuecomment-838748910, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGVJ77DOCTWYIWIC7DTNFI2JANCNFSM422LKPQQ .

C-grapes commented 3 years ago

Hi putao, You may directly pull the github version for the latest updates. The conda env of v1.9.6 can be used to drive the github version. Best, Shujun On Wed, May 12, 2021 at 12:07 AM C-grapes @.> wrote: Hi shujun, I have doubts about the latest version of EDTA. The latest version I saw on https://anaconda.org/bioconda/edta/files is 1.9.6.2. What I download now is the installation package noarch/edta-1.9.6-hdfd78af_2.tar.bz2, and it shows version 1.9.6 when it runs. How can I find version 1.9.7? I also encountered the similar issue as the above two people. My TElib.fa file contains a lot of results like TE00000062#LTR/unknown, and there is also the Helitron naming issue, as Ivannahnat said. Has this bug been fixed in version 1.9.6? TE00000062#LTR/unknown TE00000104#LTR/unknown TE00000112#LTR/unknown TE00000116#LTR/unknown TE00000126#LTR/unknown TE00000132#LTR/unknown TE00000137#LTR/unknown TE00000141#LTR/unknown TE00000186#LTR/unknown TE00000192#LTR/unknown TE00000246#LTR/unknown TE00000303#LTR/unknown TE00000589#LTR/unknown TE00000650#LTR/unknown TE00000678#LTR/unknown TE00000767#LTR/unknown TE00000868#LTR/unknown TE00001026#LTR/unknown TE00001071#LTR/unknown ######################################################## Extensive de-novo TE Annotator (EDTA) v1.9.6 Shujun Ou ( @.) ######################################################## Best wish ! putao — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#184 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGVJ77DOCTWYIWIC7DTNFI2JANCNFSM422LKPQQ .

Thanks for your reply, I will try it!