oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
342 stars 73 forks source link

program seems to stop at TIR-learner step #208

Closed svedwards closed 2 years ago

svedwards commented 3 years ago

I am running EDTA on a 1 Gb bird genome but it has not gotten past the TIR-learner step after 7 days. I am using the bioconda version 1.9.6. My command is:

singularity exec --cleanenv /cvmfs/singularity.galaxyproject.org/e/d/edta:1.9.6--hdfd78af_2 EDTA.pl \ --genome Catbic_final.assembly.fasta --anno 1 -t 24 --evaluate 1

My outfile after 7 days is:

########################################################

Extensive de-novo TE Annotator (EDTA) v1.9.6
Shujun Ou (shujun.ou.1@gmail.com)

########################################################

Sat Jul 24 19:12:58 UTC 2021 Dependency checking: All passed!

Sat Jul 24 19:13:52 UTC 2021 Obtain raw TE libraries using various structure-based programs: Sat Jul 24 19:13:52 UTC 2021 EDTA_raw: Check dependencies, prepare working directories.

Sat Jul 24 19:14:11 UTC 2021 Start to find LTR candidates.

Sat Jul 24 19:14:11 UTC 2021 Identify LTR retrotransposon candidates from scratch.

Sat Jul 24 19:25:18 UTC 2021 Finish finding LTR candidates.

Sat Jul 24 19:25:18 UTC 2021 Start to find TIR candidates.

Sat Jul 24 19:25:18 UTC 2021 Identify TIR candidates from scratch.

Species: others

The last file written is in this directory and was written on on Jul 24:

Catbic_final.assembly.fasta.mod.EDTA.raw/TIR/Module3_New/TIR-Learner/TIR-Learner-+-scaffold_783-+-GRFmite.fa-+-p-+-toPre.fa-+-predi.fa

Any suggestions?

Scott

oushujun commented 3 years ago

Hi Scott,

Thank you for using EDTA. From your reports, I don't see any errors. Probably there are lots of TIR candidates in your bird genome and the TIR module generally takes more time to finish. If you are in doubt of whether the program executes correctly, you may test it on one of your genome's sequences.

Good luck, Shujun

On Sun, Aug 1, 2021 at 1:18 PM Scott V. Edwards @.***> wrote:

I am running EDTA on a 1 Gb bird genome but it has not gotten past the TIR-learner step after 7 days. I am using the bioconda version 1.9.6. My command is:

singularity exec --cleanenv /cvmfs/ singularity.galaxyproject.org/e/d/edta:1.9.6--hdfd78af_2 EDTA.pl --genome Catbic_final.assembly.fasta --anno 1 -t 24 --evaluate 1

My outfile after 7 days is:

######################################################## Extensive de-novo TE Annotator (EDTA) v1.9.6 Shujun Ou ( @.***)

########################################################

Sat Jul 24 19:12:58 UTC 2021 Dependency checking: All passed!

Sat Jul 24 19:13:52 UTC 2021 Obtain raw TE libraries using various structure-based programs: Sat Jul 24 19:13:52 UTC 2021 EDTA_raw: Check dependencies, prepare working directories.

Sat Jul 24 19:14:11 UTC 2021 Start to find LTR candidates.

Sat Jul 24 19:14:11 UTC 2021 Identify LTR retrotransposon candidates from scratch.

Sat Jul 24 19:25:18 UTC 2021 Finish finding LTR candidates.

Sat Jul 24 19:25:18 UTC 2021 Start to find TIR candidates.

Sat Jul 24 19:25:18 UTC 2021 Identify TIR candidates from scratch.

Species: others

The last file written is in this directory and was written on on Jul 24:

Catbic_final.assembly.fasta.mod.EDTA.raw/TIR/Module3_New/TIR-Learner/TIR-Learner-+-scaffold_783-+-GRFmite.fa-+-p-+-toPre.fa-+-predi.fa

Any suggestions?

Scott

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/208, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHCZTKXTZ6OLFZWWCTT2WFWVANCNFSM5BLMBXLQ .

svedwards commented 3 years ago

Thank you, Shujun, I guess I will just be patient and may try the program on one scaffold, for example. One possibility for future releases of EDTA is to have it report updates periodically, like once a day. Like "working on ..." or "still working on...". Just so the user knows that the program is still moving. It's easy to get suspicious if no activity is visible for many days. Thanks again!

Scott

svedwards commented 3 years ago

Strange, I just ran one-tenth of the genome (~100 Mb) and it finished in 1 hour 15 minutes. The only other change I made was to copy the source genome to the local directory (previously, despite the path above, the source genome was in another directory of another user). So, maybe there was some read-write issue before. I'm now running the full genome locally and I'll let you know what happens.

svedwards commented 3 years ago

@oushujun Hi, I think part of the problem may be a memory issue. When I increased the memory per CPU on exactly the same job, it was able to complete the TIR tasks (output below). It is strange, however, that no errors would be reported if memory is too low - EDTA simply appears to be stuck with low memory. May be something to look into.

Output when mem-per-cpu was increased from 1000 to 5000 Mb:

########################################################

Extensive de-novo TE Annotator (EDTA) v1.9.6
Shujun Ou (shujun.ou.1@gmail.com)

########################################################

Wed Aug 4 08:43:06 UTC 2021 Dependency checking: All passed!

Wed Aug 4 08:44:00 UTC 2021 Obtain raw TE libraries using various structure-based programs: Wed Aug 4 08:44:00 UTC 2021 EDTA_raw: Check dependencies, prepare working directories.

Wed Aug 4 08:44:19 UTC 2021 Start to find LTR candidates.

Wed Aug 4 08:44:19 UTC 2021 Identify LTR retrotransposon candidates from scratch.

Wed Aug 4 08:51:58 UTC 2021 Finish finding LTR candidates.

Wed Aug 4 08:51:58 UTC 2021 Start to find TIR candidates.

Wed Aug 4 08:51:58 UTC 2021 Identify TIR candidates from scratch.

Species: others Wed Aug 4 12:36:15 UTC 2021 Finish finding TIR candidates.

Wed Aug 4 12:36:15 UTC 2021 Start to find Helitron candidates.

Wed Aug 4 12:36:15 UTC 2021 Identify Helitron candidates from scratch.

oushujun commented 3 years ago

Hey Scott,

I didn't realize it was you! Small world! Haha. (If you still remember the gallon milk acquaintance in Providence, Evolution meeting 2018)

Back to this topic, yes, I noticed the memory stalling issue previously, but it didn't come up to me in your case, sorry. Basically the TIR module (written in Python) requires more memory than the other modules and myself have not been able to improve it. Hopefully, this issue will serve as a reminder to the later comers to put more memory on their run when TIR stops without errors.

Wish you all the best, Shujun

svedwards commented 3 years ago

Hi Shujun!

I didn't realize we had crossed paths in Providence! Seems like another world, pre-covid. It's really great you have put EDTA together. I'll send you our paper once it is completed. My analysis of the 1 Gb genome finally completed successfully. Using 24 threads it took about 64 hours. I will paste the full outfile below, but let me know if you want it in the "successful" repo.

########################################################

Extensive de-novo TE Annotator (EDTA) v1.9.6
Shujun Ou (shujun.ou.1@gmail.com)

########################################################

Wed Aug 4 08:43:06 UTC 2021 Dependency checking: All passed!

Wed Aug 4 08:44:00 UTC 2021 Obtain raw TE libraries using various structure-based programs: Wed Aug 4 08:44:00 UTC 2021 EDTA_raw: Check dependencies, prepare working directories.

Wed Aug 4 08:44:19 UTC 2021 Start to find LTR candidates.

Wed Aug 4 08:44:19 UTC 2021 Identify LTR retrotransposon candidates from scratch.

Wed Aug 4 08:51:58 UTC 2021 Finish finding LTR candidates.

Wed Aug 4 08:51:58 UTC 2021 Start to find TIR candidates.

Wed Aug 4 08:51:58 UTC 2021 Identify TIR candidates from scratch.

Species: others Wed Aug 4 12:36:15 UTC 2021 Finish finding TIR candidates.

Wed Aug 4 12:36:15 UTC 2021 Start to find Helitron candidates.

Wed Aug 4 12:36:15 UTC 2021 Identify Helitron candidates from scratch.

Wed Aug 4 20:06:17 UTC 2021 Finish finding Helitron candidates.

Wed Aug 4 20:06:17 UTC 2021 Execution of EDTA_raw.pl is finished!

Wed Aug 4 20:06:17 UTC 2021 Obtain raw TE libraries finished. All intact TEs found by EDTA: Catbic_final.assembly_mem.fasta.mod.EDTA.intact.fa Catbic_final.assembly_mem.fasta.mod.EDTA.intact.gff3

Wed Aug 4 20:06:17 UTC 2021 Perform EDTA advcance filtering for raw TE candidates and generate the stage 1 library:

Wed Aug 4 20:15:39 UTC 2021 EDTA advcance filtering finished.

Wed Aug 4 20:15:39 UTC 2021 Perform EDTA final steps to generate a non-redundant comprehensive TE library:

            Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

2021-08-05 20:41:02,891 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH 2021-08-05 20:41:02,924 -INFO- VARS: {'sequence': 'Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa.rexdb', 'force_write_hmmscan': False, 'processors': 24, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0} 2021-08-05 20:41:02,924 -INFO- checking dependencies: 2021-08-05 20:41:03,646 -INFO- hmmer 3.3.2 OK 2021-08-05 20:41:03,713 -INFO- blastn 2.10.0+ OK 2021-08-05 20:41:03,714 -INFO- check database rexdb 2021-08-05 20:41:03,714 -INFO- db path: /usr/local/lib/python3.6/site-packages/TEsorter/database 2021-08-05 20:41:03,714 -INFO- db file: REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-08-05 20:41:03,715 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm OK 2021-08-05 20:41:03,715 -INFO- Start classifying pipeline 2021-08-05 20:41:03,760 -INFO- total 330 sequences 2021-08-05 20:41:03,760 -INFO- translating Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa in six frames /usr/local/lib/python3.6/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. BiopythonWarning, 2021-08-05 20:41:04,051 -INFO- HMM scanning against /usr/local/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2021-08-05 20:41:04,089 -INFO- Creating server instance (pp-1.6.4.4) 2021-08-05 20:41:04,089 -INFO- Running on Python 3.6.13 linux 2021-08-05 20:41:08,463 -INFO- pp local server started with 24 workers 2021-08-05 20:41:08,496 -INFO- Task 0 started 2021-08-05 20:41:08,497 -INFO- Task 1 started 2021-08-05 20:41:08,497 -INFO- Task 2 started 2021-08-05 20:41:08,498 -INFO- Task 3 started 2021-08-05 20:41:08,498 -INFO- Task 4 started 2021-08-05 20:41:08,498 -INFO- Task 5 started 2021-08-05 20:41:08,499 -INFO- Task 6 started 2021-08-05 20:41:08,499 -INFO- Task 7 started 2021-08-05 20:41:08,500 -INFO- Task 8 started 2021-08-05 20:41:08,500 -INFO- Task 9 started 2021-08-05 20:41:08,501 -INFO- Task 10 started 2021-08-05 20:41:08,501 -INFO- Task 11 started 2021-08-05 20:41:08,502 -INFO- Task 12 started 2021-08-05 20:41:08,502 -INFO- Task 13 started 2021-08-05 20:41:08,502 -INFO- Task 14 started 2021-08-05 20:41:08,503 -INFO- Task 15 started 2021-08-05 20:41:08,503 -INFO- Task 16 started 2021-08-05 20:41:08,503 -INFO- Task 17 started 2021-08-05 20:41:08,504 -INFO- Task 18 started 2021-08-05 20:41:08,504 -INFO- Task 19 started 2021-08-05 20:41:08,504 -INFO- Task 20 started 2021-08-05 20:41:08,505 -INFO- Task 21 started 2021-08-05 20:41:08,505 -INFO- Task 22 started 2021-08-05 20:41:08,524 -INFO- Task 23 started 2021-08-05 20:41:10,317 -INFO- generating gene anntations 2021-08-05 20:41:10,358 -INFO- 19 sequences classified by HMM 2021-08-05 20:41:10,358 -INFO- see protein domain sequences in Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa.rexdb.dom.faa and annotation gff3 file in Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa.rexdb.dom.gff3 2021-08-05 20:41:10,358 -INFO- classifying the unclassified sequences by searching against the classified ones 2021-08-05 20:41:10,368 -INFO- using the 80-80-80 rule 2021-08-05 20:41:10,368 -INFO- run CMD: makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl 2021-08-05 20:41:10,496 -INFO- run CMD: blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 24 2021-08-05 20:41:10,727 -INFO- 0 sequences classified in pass 2 2021-08-05 20:41:10,728 -INFO- total 19 sequences classified. 2021-08-05 20:41:10,728 -INFO- see classified sequences in Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa.rexdb.cls.tsv 2021-08-05 20:41:10,728 -INFO- writing library for RepeatMasker in Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa.rexdb.cls.lib 2021-08-05 20:41:10,741 -INFO- writing classified protein domains in Catbic_final.assembly_mem.fasta.mod.RM.consensi.fa.rexdb.cls.pep 2021-08-05 20:41:10,744 -INFO- Summary of classifications: Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Retrovirus 9 0 0 0 LTR mixture 1 0 0 0 LINE unknown 9 0 0 0 2021-08-05 20:41:10,744 -INFO- Pipeline done. 2021-08-05 20:41:10,744 -INFO- cleaning the temporary directory ./tmp Skipping the CDS cleaning step (--cds [File]) since no CDS file is provided or it's empty.

Thu Aug 5 20:49:56 UTC 2021 EDTA final stage finished! You may check out: The final EDTA TE library: Catbic_final.assembly_mem.fasta.mod.EDTA.TElib.fa Thu Aug 5 20:49:56 UTC 2021 Perform post-EDTA analysis for whole-genome annotation:

Thu Aug 5 20:49:56 UTC 2021 Homology-based annotation of TEs using Catbic_final.assembly_mem.fasta.mod.EDTA.TElib.fa from scratch.

Thu Aug 5 21:20:18 UTC 2021 TE annotation using the EDTA library has finished! Check out: Whole-genome TE annotation (total TE: 7.62%): Catbic_final.assembly_mem.fasta.mod.EDTA.TEanno.gff3 Whole-genome TE annotation summary: Catbic_final.assembly_mem.fasta.mod.EDTA.TEanno.sum Low-threshold TE masking for MAKER gene annotation (masked: 7.24%): Catbic_final.assembly_mem.fasta.mod.MAKER.masked

Thu Aug 5 21:20:19 UTC 2021 Evaluate the level of inconsistency for whole-genome TE annotation (slow step):

Sat Aug 7 01:11:41 UTC 2021 Evaluation of TE annotation finished! Check out these files:

            Overall: Catbic_final.assembly_mem.fasta.mod.EDTA.TE.fa.stat.all.sum
            Nested: Catbic_final.assembly_mem.fasta.mod.EDTA.TE.fa.stat.nested.sum
            Non-nested: Catbic_final.assembly_mem.fasta.mod.EDTA.TE.fa.stat.redun.sum
oushujun commented 3 years ago

Hi Scott,

Yes, it feels like a different world. I am not sure if I have the confidence to go to conferences next year. I am glad you finish the bird genome annotation. And I did add a protip in the Readme to remind others about the memory trick. Looks like it finished without errors, but only 7.62% of the genome are TEs seems pretty low to me, although I barely know any bird genome biology. You may want to try different methods if this also seems low to you.

Best, Shujun