Closed natalia-rodilla closed 4 months ago
Hi @natalia-rodilla,
Thank you very much for the detailed report and the test case. I was able to reproduce your error on my system.
As you have discerned already, it seems to be a bug that happens only when there is a single genome in a given cluster. This is where it is failing in the genomesimilarity.py
code:
if len(cluster) == 1:
return cluster[0]
The problem is that the pick_representative_with_largest_Qscore()
function (in which this code is), seems to expect that the cluster
variable is a list, when in fact it is a set.
The very simple fix was to cast cluster
to a list before trying to extract the sole element, which I implemented in commit fbc6cf30a44515733ea1659153c60fcf8baa6c1a .
This is the output I get on your test set after running with the fixed code in the development branch of anvi'o:
Run mode .....................................: fastANI
CITATION
===============================================
Anvi'o will use 'fastANI' by Jain et al. (DOI: 10.1038/s41467-018-07641-9) to
compute ANI. If you publish your findings, please do not forget to properly
credit their work.
[fastANI] Kmer size ..........................: 16
[fastANI] Fragment length ....................: 3,000
[fastANI] Min fraction of alignment ..........: 0.25
[fastANI] Num threads to use .................: 1
[fastANI] Log file path ......................: /var/folders/nc/7dlw5z2j16q3s14586qddwl8nhxpgh/T/tmpeywzl020
fastANI similarity metric ....................: calculated
Number of genomes considered .................: 7
Number of redundant genomes ..................: 5
Final number of dereplicated genomes .........: 2
ANI RESULTS
===============================================
* Matrix and clustering of 'ani' written to output directory
* Matrix and clustering of 'alignment fraction' written to output directory
* Matrix and clustering of 'mapping fragments' written to output directory
* Matrix and clustering of 'total fragments' written to output directory
* Cleaning up the temp directory (you can use `--debug` if you would like to keep
it for testing purposes)
The similarity scores output gets written, and the resulting representative genomes for each cluster in derep99Q_test10/GENOMES/
are BA_92.fa
and UW8_POB.fa
. :)
So if you were to install anvio-dev
by following the instructions here: https://anvio.org/install/linux/dev/
and pull the latest commits to the repository, you will be able to run this program without this error.
Short description of the problem
anvi-dereplicate-genomes
fails when some genomes are not grouped in a cluster. See discord: https://discord.com/channels/1002537821212512296/1205459293932093450anvi'o version
Anvi'o .......................................: marie (v8) Python .......................................: 3.10.13
Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2
System info
Which operating system you are using? Ubuntu 18.04.6 LTS How did you install anvi'o? Using conda
Detailed description of the issue
I tried to run
anvi-dereplicate-genomes
on a collection of (66) external genomes. I expected some of them to be clustered together and others to be the only ones in their cluster. I got an error instead:I tried using pyANI, with the same error. In neither case I obtained the similarity results, although I had previously run
anvi-compute-genome-similarity
on the same set of external genomes, successfully.When I ran
anvi-dereplicate-genomes
setting--representative-method
to length or centrality (default) on the same set of genomes, I didn't have any issues.When I tried to subset some genomes for a reproducible example, I managed to narrow down the issue. If I'm testing genomes that cluster with others (one or several clusters) it runs without problems. If I test genomes that end up alone in a cluster, the error appears, even if there is only 1 that is not clustered together with others.
Command, messages and traceback text:
Files / commands to reproduce the issue
Command:
anvi-dereplicate-genomes -e test9-external-genomes.txt -o derep99Q_test/ --program fastANI --similarity-threshold 0.99 --representative-method Qscore
I uploaded the files here: https://drive.google.com/drive/folders/1mHq6k2pNwlzufIlT_tSDnuKJpAtlSh6i?usp=drive_link The "problematic" genome in this case is the last one in the genomes file.