qjiangzhao / TEtrimmer

TEtrimmer: a novel tool to automate manual curation of transposable elements
GNU General Public License v3.0
66 stars 2 forks source link

Issue with test set #43

Closed pedroh3ringer closed 2 weeks ago

pedroh3ringer commented 2 months ago

Hi,

I tried to run TEtrimmer with the test set, using the command:

TEtrimmer --input_file test_input.fa --genome_file test_genome.fasta --output_dir test_output --num_threads 20 --classify_all

Which generated the expected output directories. However, the directories within 'TEtrimmer_for_proof_curation' are empty and the output message was:

TE Trimmer is modifying sequence names; any occurrence of '/', '-', ':', '...', '|' and empty spaces before '#' will be converted to '_'. You can find the original and modified names in the 'Sequence_name_mapping.txt' file in the output directory.

TEtrimmer detected instances of '#' in your input FASTA sequence headers. The string before '#' is denoted as the seq_name, and the string after '#' is denoted as the TE type.

Finish to generate single sequence files.

8 sequences are detected from the input file Progress: |--------------------------------------------------| 0/8 = 0.0% Complete

rnd_6_family_3291 is skipped due to blast hit number is 0

7 sequences have not been analysed. In the analysed sequences 1 are skipped. Note: not all skipped sequences can have TE Aid plot in the 'TEtrimmer_for_proof_curation' folder. In the analysed sequences 0 are identified as low copy TE.

You might find the reasons why some sequences were not analysed from the 'error_file.txt' in the 'Multiple_sequence_alignment' directory.

Less than 30% TE are classified, TEtrimmer won't classify 'Unknown' TE by classified TE.

TEtrimmer is removing sequence duplications. This might take long time when many sequencesare included into the final consensus library. Please be patient!

cd-hit-est failed for TEtrimmer_consensus.fasta with error code 1

Fatal Error: Failed to open the database file Program halted !!

The final CD-HIT-EST merge step cannot be performed. Final TE consensus library redundancy can be higher but the sensitivity is not affected. You can remove duplicated sequence by yourself.

You can choose to ignore CD-HIT-EST error. For traceback output, please refer to 'error_file.txt' in the 'Multiple_sequence_alignment' directory.

TEtrimmer is clustering TE consensus library. This can potentially take long time when many sequences exist in the consensus library. Please be patient!

Final clustering of proof curation files failed with error local variable 'sequence_info' referenced before assignment

Traceback (most recent call last): File "/home/pedro/miniconda3/envs/TEtrimmer/share/tetrimmer/TEtrimmer.py", line 533, in main sequence_info, perfect_proof, good_proof, intermediate_proof, need_check_proof) UnboundLocalError: local variable 'sequence_info' referenced before assignment

This does not affect the final TE consensus sequences. But this can heavily complicate the TE proof curation. If you don't plan to do proof curation, you can choose to ignore this error.

Progress: |██████--------------------------------------------| 1/8 = 12.5% Complete

This message is somewhat similar to the one reported in this issue: https://github.com/qjiangzhao/TEtrimmer/issues/27 However, in my case, the issue was with the test set and not the actual data that I want to analyze, so I think the issue could be different. Thanks in advance for your help!

qjiangzhao commented 2 months ago

Hi @pedroh3ringer,

You can try to delete the BLASTN database for the test genome and try it again. If the error still exists, please send me your entire test output folder and I will have another look for the potential problems.

Yours sincerely Jiangzhao

pedroh3ringer commented 2 months ago

Hi Jiangzhao,

Thanks a lot for the quick response! I deleted the BLASTN database for the test genome and tried again, but unfortunately, I got the same message and output as the one mentioned above. I’m sending the entire test output folder attached in this message.

Best, Pedro

test_output1.tar.gz

qjiangzhao commented 2 months ago

Hi @pedroh3ringer:

It seems the error is caused by the python package "pypdf2", you can try to solve this by mamba install conda-forge::pypdf2 or mamba update pypdf2 in your terminal.

If that don't help, please download the new release version from TEtrimmer GitHub and run it again. The new version should be able to provide more error information.

Yours sincerely Jiangzhao

bricoletc commented 2 weeks ago

I had the same issue @qjiangzhao @pedroh3ringer, and tracked it down to PyPDF indeed:

Traceback (most recent call last):
  File "/home/adminbrice/Softs/miniforge3/envs/tetrimmer/share/tetrimmer/boundarycrop.py", line 841, in find_boundary_and_crop
    scale_dotplot_pdf = scale_single_page_pdf(dotplot_pdf, f"{dotplot_pdf}_su.pdf", scale_ratio=2)
  File "/home/adminbrice/Softs/miniforge3/envs/tetrimmer/share/tetrimmer/functions.py", line 1988, in scale_single_page_pdf
    pdf_reader = PdfFileReader(input_pdf_path)
  File "/home/adminbrice/Softs/miniforge3/envs/tetrimmer/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1974, in __init__
    deprecation_with_replacement("PdfFileReader", "PdfReader", "3.0.0")
  File "/home/adminbrice/Softs/miniforge3/envs/tetrimmer/lib/python3.10/site-packages/PyPDF2/_utils.py", line 369, in deprecation_with_replacement
    deprecation(DEPR_MSG_HAPPENED.format(old_name, removed_in, new_name))
  File "/home/adminbrice/Softs/miniforge3/envs/tetrimmer/lib/python3.10/site-packages/PyPDF2/_utils.py", line 351, in deprecation
    raise DeprecationError(msg)
PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

Run on the test data worked after pip install 'PyPDF2<3.0'

In the process of debugging this I spotted an issue here: https://github.com/qjiangzhao/TEtrimmer/blob/314a9e86fc504398c343d6f49502c2a8fc648299/tetrimmer/boundarycrop.py#L843-L849

Line 848 refers to e that does not exist, you need except Exception as e line 843. Because that raises an Exception, you never skip this part, and never get to writing out final_con_file on line 1088, which is the missing TEtrimmer_consensus.fasta file required by cd-hit.

Best ;)

qjiangzhao commented 2 weeks ago

Dear @bricoletc

Thanks for your feedback. Then I will close this issue.

Many thanks for your debugging. I have modified the code and will push it along with the next main update.

Yours sincerely Jiangzhao

bricoletc commented 2 weeks ago

Good news, I guess it's worth pinning pypdf on bioconda? https://github.com/bioconda/bioconda-recipes/blob/master/recipes/tetrimmer/meta.yaml Or updating the call to it in your code To avoid this issue altogether!

qjiangzhao commented 2 weeks ago

Yes, many thanks again. we will update the TEtrimmer Conda package next month and will do that!