merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
443 stars 145 forks source link

[BUG] Error when writing output files with anvi-dereplicate-genomes and pyANI #2339

Closed uguyet closed 1 month ago

uguyet commented 2 months ago

Short description of the problem

anvi-dereplicate-genomes with pyANI has a bug when trying to create output files.

anvi'o version

v8-dev and v8

System info

OS: Ubuntu 22.04.4 LTS anvio was install using conda

Detailed description of the issue

I launched the following command: anvi-dereplicate-genomes --fasta-text-file fasta_path_pyANI.tab --program pyANI --min-alignment-fraction 0.25 --similarity-threshold 0.98 -o output_pyANI/ -T 3 and got the following output:

pyANI similarity metric ......................: calculated
Number of genomes considered .................: 31
Number of redundant genomes ..................: 11                                                                                                                                                                 
Final number of dereplicated genomes .........: 20
Traceback (most recent call last):
  File "/home/uguyet/github/anvio/bin/anvi-dereplicate-genomes", line 118, in <module>
    derep.report()
  File "/home/uguyet/github/anvio/anvio/genomesimilarity.py", line 304, in report
    self.populate_genomes_dir()
  File "/home/uguyet/github/anvio/anvio/genomesimilarity.py", line 324, in populate_genomes_dir
    shutil.copy(src = temp_path, dst = output_path)
  File "/home/uguyet/miniconda3/envs/anvio-dev/lib/python3.10/shutil.py", line 417, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/uguyet/miniconda3/envs/anvio-dev/lib/python3.10/shutil.py", line 256, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'output_pyANI/GENOMES/input_pyANI/SAM1989335.fa.fa'
meren commented 1 month ago

Dear @uguyet,

Sorry about this inconvenience and for your patience. I only had a chance to look at it today, and tried to reproduce it using these genomes.

Things worked well for me with the current anvio-dev as you can see here:

$  ~/Downloads/genomes >>> anvi-dereplicate-genomes --fasta-text fasta.txt --program pyANI --min-alignment-fraction 0.25 --similarity-threshold 0.98 -o output_pyANI/ -T 3
Run mode .....................................: pyANI

CITATION
===============================================
Anvi'o will use 'PyANI' by Pritchard et al. (DOI: 10.1039/C5AY02550H) to compute
ANI. If you publish your findings, please do not forget to properly credit their
work.

[PyANI] Num threads to use ...................: 3
[PyANI] Alignment method .....................: ANIb
[PyANI] Log file path ........................: /var/folders/_1/yvhyjg5j1wl09t0cx345j4hd87vf3t/T/tmpomsn7y8o

WARNING
===============================================
THIS IS VERY IMPORTANT! You asked anvi'o to remove any hits between two genomes
if they had a full percent identity less than '0.20'. Anvi'o found 4 such
instances between the pairwise comparisons of your 3 genomes, and is about to
set all ANI scores between these instances to 0. For instance, one of your
genomes, 'genome_01', had a full percentage identity of 0.029 relative to
'genome_03', another one of your genomes, which is below your threshold, and so
the ANI scores will be ignored (set to 0) for all downstream reports you will
find in anvi'o tables and visualizations. Anvi'o kindly invites you to carefully
think about potential implications of discarding hits based on an arbitrary
alignment fraction, but does not judge you because it is not perfect either.

WARNING
===============================================
THIS IS VERY IMPORTANT! You asked anvi'o to remove any hits between two genomes
if the hit was produced by a weak alignment (which you defined as alignment
fraction less than '0.25'). Anvi'o found 4 such instances between the pairwise
comparisons of your 3 genomes, and is about to set all ANI scores between these
instances to 0. For instance, one of your genomes, 'genome_01', was 0.708
identical to 'genome_03', another one of your genomes, but the aligned fraction
of genome_01 to genome_03 was only 0.041 and was below your threshold, and so
the ANI scores will be ignored (set to 0) for all downstream reports you will
find in anvi'o tables and visualizations. Anvi'o kindly invites you to carefully
think about potential implications of discarding hits based on an arbitrary
alignment fraction, but does not judge you because it is not perfect either.

pyANI similarity metric ......................: calculated
Number of genomes considered .................: 3
Number of redundant genomes ..................: 1
Final number of dereplicated genomes .........: 2

ANI RESULTS
===============================================
* Matrix and clustering of 'alignment coverage' written to output directory
* Matrix and clustering of 'alignment lengths' written to output directory
* Matrix and clustering of 'hadamard' written to output directory
* Matrix and clustering of 'percentage identity' written to output directory
* Matrix and clustering of 'similarity errors' written to output directory
* Matrix and clustering of 'full percentage identity' written to output directory

* Cleaning up the temp directory (you can use `--debug` if you would like to keep
  it for testing purposes)

$ ~/Downloads/genomes >>> cat output_pyANI/CLUSTER_REPORT.txt
cluster size    representative  genomes
cluster_000001  1   genome_01   genome_01
cluster_000002  2   genome_03   genome_02,genome_03

I am wondering if this is an issue due to some Linux specific issue.

@ahenoch, @metehaansever, since I know you're using Linux -- can either of you please download these genomes and run the following commands to see if you get the same error @uguyet got?

tar -zxvf genomes.tar.gz
cd genomes/
anvi-dereplicate-genomes --fasta-text fasta.txt \
                         --program pyANI \
                         --min-alignment-fraction 0.25 \
                         --similarity-threshold 0.98 \
                         -o output_pyANI/ \
                         -T 3

Thank you!

metehaansever commented 1 month ago

Hi @meren I just ran it and it works successfully for me on my Ubuntu.

meren commented 1 month ago

Thanks, @metehaansever.

meren commented 1 month ago

My additional attempts to reproduce this failed here :( Closing it now with the hope that @uguyet will come back to us if the problem continues or still relevant.