[BUG] anvi-dereplicate-genomes python? error

Short description of the problem

anvi-dereplicate-genomes fails when some genomes are not grouped in a cluster. See discord: https://discord.com/channels/1002537821212512296/1205459293932093450

anvi'o version

Anvi'o .......................................: marie (v8) Python .......................................: 3.10.13

Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

System info

Which operating system you are using? Ubuntu 18.04.6 LTS How did you install anvi'o? Using conda

Detailed description of the issue

I tried to run anvi-dereplicate-genomes on a collection of (66) external genomes. I expected some of them to be clustered together and others to be the only ones in their cluster. I got an error instead: anvio-error

I tried using pyANI, with the same error. In neither case I obtained the similarity results, although I had previously run anvi-compute-genome-similarity on the same set of external genomes, successfully.

When I ran anvi-dereplicate-genomes setting --representative-method to length or centrality (default) on the same set of genomes, I didn't have any issues.

When I tried to subset some genomes for a reproducible example, I managed to narrow down the issue. If I'm testing genomes that cluster with others (one or several clusters) it runs without problems. If I test genomes that end up alone in a cluster, the error appears, even if there is only 1 that is not clustered together with others.

Command, messages and traceback text:

anvi-dereplicate-genomes -e test9-external-genomes.txt -o derep99Q_test10/ --program fastANI --similarity-threshold 0.99 --representative-method Qscore
Run mode .....................................: fastANI

CITATION
===============================================
Anvi'o will use 'fastANI' by Jain et al. (DOI: 10.1038/s41467-018-07641-9) to
compute ANI. If you publish your findings, please do not forget to properly
credit their work.

[fastANI] Kmer size ..........................: 16
[fastANI] Fragment length ....................: 3,000
[fastANI] Min fraction of alignment ..........: 0.25
[fastANI] Num threads to use .................: 1
[fastANI] Log file path ......................: /tmp/tmp1nhwi5cf

fastANI similarity metric ....................: calculated
Number of genomes considered .................: 7
[09 Feb 24 16:11:21 Dereplication] All 21 pairwise comparisons have been made                                                                                                                                                      ETA: NoneTraceback (most recent call last):
  File "/home/bioinfoteam/anaconda3/envs/anvio-8/bin/anvi-dereplicate-genomes", line 118, in <module>
    derep.process()
  File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 390, in process
    self.dereplicate()
  File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 511, in dereplicate
    self.cluster_to_representative = self.get_representative_for_each_cluster()
  File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 595, in get_representative_for_each_cluster
    representative_name = self.pick_representative_with_largest_Qscore(cluster)
  File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 549, in pick_representative_with_largest_Qscore
    return cluster[0]
TypeError: 'set' object is not subscriptable

Files / commands to reproduce the issue

Command: anvi-dereplicate-genomes -e test9-external-genomes.txt -o derep99Q_test/ --program fastANI --similarity-threshold 0.99 --representative-method Qscore

I uploaded the files here: https://drive.google.com/drive/folders/1mHq6k2pNwlzufIlT_tSDnuKJpAtlSh6i?usp=drive_link The "problematic" genome in this case is the last one in the genomes file.

Hi @natalia-rodilla,

Thank you very much for the detailed report and the test case. I was able to reproduce your error on my system.

As you have discerned already, it seems to be a bug that happens only when there is a single genome in a given cluster. This is where it is failing in the genomesimilarity.py code:

if len(cluster) == 1:
      return cluster[0]

The problem is that the pick_representative_with_largest_Qscore() function (in which this code is), seems to expect that the cluster variable is a list, when in fact it is a set.

The very simple fix was to cast cluster to a list before trying to extract the sole element, which I implemented in commit fbc6cf30a44515733ea1659153c60fcf8baa6c1a .

This is the output I get on your test set after running with the fixed code in the development branch of anvi'o:

Run mode .....................................: fastANI

CITATION
===============================================
Anvi'o will use 'fastANI' by Jain et al. (DOI: 10.1038/s41467-018-07641-9) to
compute ANI. If you publish your findings, please do not forget to properly
credit their work.

[fastANI] Kmer size ..........................: 16
[fastANI] Fragment length ....................: 3,000
[fastANI] Min fraction of alignment ..........: 0.25
[fastANI] Num threads to use .................: 1
[fastANI] Log file path ......................: /var/folders/nc/7dlw5z2j16q3s14586qddwl8nhxpgh/T/tmpeywzl020

fastANI similarity metric ....................: calculated
Number of genomes considered .................: 7
Number of redundant genomes ..................: 5
Final number of dereplicated genomes .........: 2

ANI RESULTS
===============================================
* Matrix and clustering of 'ani' written to output directory
* Matrix and clustering of 'alignment fraction' written to output directory
* Matrix and clustering of 'mapping fragments' written to output directory
* Matrix and clustering of 'total fragments' written to output directory

* Cleaning up the temp directory (you can use `--debug` if you would like to keep
  it for testing purposes)

The similarity scores output gets written, and the resulting representative genomes for each cluster in derep99Q_test10/GENOMES/ are BA_92.fa and UW8_POB.fa. :)

So if you were to install anvio-dev by following the instructions here: https://anvio.org/install/linux/dev/ and pull the latest commits to the repository, you will be able to run this program without this error.

merenlab / anvio