oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

PanEDTA Line Detection #424

Open sjteresi opened 4 months ago

sjteresi commented 4 months ago

Hello Shujun,

Hope you are doing well. I am writing to share that I had issues with LINE detection in PanEDTA. I am hoping that this will help anyone else who encounters this issue. It is not a bug, just something that I think folks could easily overlook. When I ran PanEDTA (v2.1.0) on its own without pre-calculating results with regular EDTA, it was not finding any LINE elements in my genomes.

After doing some testing, I think it is because the panEDTA script by default calls EDTA.pl without the --sensitive 1 option. The sensitive option calls RepeatModeler. I also observed that when I ran regular EDTA.pl on a genome without the sensitive option, it did not recover any LINEs. So to summarize, it seems that RepeatModeler was doing the heavy lifting for LINE detection in my strawberry genomes, and without it, I wasn't detecting any LINEs. Jordan B, a post-doc in Pat's lab also had this same LINE issue with some Camelina genomes.

In my case, I fixed the issue by running EDTA individually on each genome with the option, and completed the pangenome annotation with panEDTA. That approach worked fine, LINEs were indeed included in my final annotation.

This problem only arises if users decide to use panEDTA to perform all steps of their pangenome annotation. It can easily be sidestepped if user's create the individual annotations with the --sensitive 1 option first.

Sincerely, Scott Teresi

oushujun commented 4 months ago

Hi Scott,

Thank you for reporting this! Can you please update EDTA to 2.2.0 and test panEDTA again? There are many big changes to the new version for improved SINE/LINE annotations.

Thanks, Shujun

oushujun commented 2 months ago

Any luck?

Shujun

sjteresi commented 2 months ago

My apologies for the delayed response Shujun, I will update EDTA this weekend or early next week and follow-up

sjteresi commented 2 months ago

Hi Shujun,

I am actually having additional trouble with EDTA 2.2.0 now that I have updated. I am having a lot of error getting conda to resolve dependencies when installing, so I elected to use singularity. That installation worked, but now when I run the genomes I get 0 LINE result files and the TIR detection fails, but does not crash. Here is a sample output:

Mon Apr 15 15:59:43 EDT 2024    Start to find LINE candidates.

Mon Apr 15 15:59:43 EDT 2024    Identify LINE retrotransposon candidates from scratch.

Tue Apr 16 07:25:21 EDT 2024    Warning: The LINE result file has 0 bp! 

Tue Apr 16 07:25:21 EDT 2024    Start to find TIR candidates.

Tue Apr 16 07:25:21 EDT 2024    Identify TIR candidates from scratch.

Species: others
find: ./TIR-Learner-+-TIRvish.gff3: No such file or directory
Traceback (most recent call last):
  File "/mnt/ufs18/rs-004/edgerpat_lab/EDTA/bin/TIR-Learner3.0/TIR-Learner3.0.py", line 80, in <module>
    TIRLearner_instance = TIRLearner(genome_file, genome_name, species, TIR_length,

The test that you included in your README works mostly... The dependencies check out, but I get a similar set of warnings:

Tue Apr 16 12:02:56 EDT 2024    Start to find LTR candidates.

Tue Apr 16 12:02:56 EDT 2024    Identify LTR retrotransposon candidates from scratch.

Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
Tue Apr 16 12:03:25 EDT 2024    Finish finding LTR candidates.

Tue Apr 16 12:03:25 EDT 2024    Start to find SINE candidates.

Tue Apr 16 12:04:07 EDT 2024    Warning: The SINE result file has 0 bp!

Tue Apr 16 12:04:07 EDT 2024    Start to find LINE candidates.

Tue Apr 16 12:04:07 EDT 2024    Identify LINE retrotransposon candidates from scratch.

Tue Apr 16 12:05:15 EDT 2024    Warning: The LINE result file has 0 bp!

Tue Apr 16 12:05:15 EDT 2024    Start to find TIR candidates.

Tue Apr 16 12:05:15 EDT 2024    Identify TIR candidates from scratch.

Species: others
Traceback (most recent call last):
  File "/mnt/ufs18/rs-004/edgerpat_lab/EDTA/bin/TIR-Learner3.0/TIR-Learner3.0.py", line 80, in <module>
    TIRLearner_instance = TIRLearner(genome_file, genome_name, species, TIR_length,
  File "/mnt/ufs18/rs-004/edgerpat_lab/EDTA/bin/TIR-Learner3.0/bin/main.py", line 81, in __init__
    self.execute()
  File "/mnt/ufs18/rs-004/edgerpat_lab/EDTA/bin/TIR-Learner3.0/bin/main.py", line 121, in execute
    self.execute_M4()
  File "/mnt/ufs18/rs-004/edgerpat_lab/EDTA/bin/TIR-Learner3.0/bin/main.py", line 672, in execute_M4
    self["base"] = CNN_predict.execute(self)
  File "/mnt/ufs18/rs-004/edgerpat_lab/EDTA/bin/TIR-Learner3.0/bin/CNN_predict.py", line 114, in execute
    df = predict(df, TIRLearner_instance.genome_file_path,
  File "/mnt/ufs18/rs-004/edgerpat_lab/EDTA/bin/TIR-Learner3.0/bin/CNN_predict.py", line 62, in predict
    model = load_model(path_to_model)
  File "/usr/local/lib/python3.10/site-packages/keras/src/saving/saving_api.py", line 262, in load_model
    return legacy_sm_saving_lib.load_model(
  File "/usr/local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.10/site-packages/tensorflow/python/framework/function_def_to_graph.py", line 278, in function_def_to_graph_def
    input_shape = input_shape.as_proto()
AttributeError: as_proto
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /mnt/ufs18/rs-004/edgerpat_lab/EDTA/util/rename_tirlearner.pl line 19.
Warning: LOC list genome.fa.mod.TIR.ext30.list is empty.

Error: Error while loading sequence
Filter sequence based on TEsorter classifications. Unclassified sequences will also be output to the clean file.
    Usage: perl cleanup_misclas.pl sequence.fa.rexdb.cls.tsv
    Author: Shujun Ou (shujun.ou.1@gmail.com) 10/11/2019

mv: cannot stat 'genome.fa.mod.TIR.ext30.fa.pass.fa.dusted.cln.cln': No such file or directory
cp: cannot stat 'genome.fa.mod.TIR.ext30.fa.pass.fa.dusted.cln.cln.list': No such file or directory
cp: cannot stat 'genome.fa.mod.TIR.intact.raw.fa.anno.list': No such file or directory
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.gff3: No such file or directory.
Warning: The TIR result file has 0 bp!

Tue Apr 16 12:05:31 EDT 2024    Start to find Helitron candidates.

Tue Apr 16 12:05:31 EDT 2024    Identify Helitron candidates from scratch.
sjteresi commented 2 months ago

Currently re-trying with a fresh Anaconda installation and conda environment.

oushujun commented 2 months ago

The yml file should be helpful for conda installation. I don’t think the singularity version is working at the moment

Shujun

On Tue, Apr 16, 2024 at 3:36 PM Scott Teresi @.***> wrote:

Currently re-trying with a fresh Anaconda installation and conda environment.

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/424#issuecomment-2059797134, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDXPHCEGJVCJ6UNAWLY5V4UFAVCNFSM6AAAAABCRVERVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZG44TOMJTGQ . You are receiving this because you commented.Message ID: @.***>

sjteresi commented 2 months ago

Hi Shujun,

I got the latest version of EDTA to complete the system test, still running it on my genomes. I will report back. I had to install bedtools and samtools on top of the conda environment for this latest upgrade. I did not see those being specified in the yml file, and I was having trouble making the basic install work. Perhaps I am wrong and messed up the install, or maybe they were pre-loaded on your computing cluster system so they were missed. Either way, I hope this helps!