merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
433 stars 144 forks source link

[BUG] anvio-cluster-contigs fails with CONCOCT #2154

Closed Ge0rges closed 6 months ago

Ge0rges commented 11 months ago

Short description of the problem

This issue is meant to represent the following discord thread. I too encountered this error and decided to open this since nobody else has. It seems anvio is not interacting with CONCOCT properly.

anvi'o version

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.12

Profile database .............................: 38
Contigs database .............................: 22
Pan database .................................: 17
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Using rocky linux and installed following the dev instructions on the website.

Detailed description of the issue

In my case I ran anvi-cluster-contigs -p SAMPLES-MERGED/PROFILE.db -c CONTIGS.db --driver concoct -T 80 --clusters 10 -C METABINS --just-do-it. I then obtained a config error from anvio complaining it's missing a file. I went to the log and see:

# CMD LINE: concoct --coverage_file /tmp/tmp83as40pm/contig_coverages.txt --composition_file /tmp/tmp83as40pm/sequence_contigs.fa --basename /tmp/tmp83as40pm --threads 80 --clusters 10
/usr/local/miniconda3/envs/anvio-dev/bin/concoct:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('concoct==1.1.0', 'concoct')
Up and running. Check /tmp/tmp83as40pm/log.txt for progress
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/anvio-dev/bin/concoct", line 4, in <module>
    __import__('pkg_resources').run_script('concoct==1.1.0', 'concoct')
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pkg_resources/__init__.py", line 722, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1561, in run_script
    exec(code, namespace, namespace)
  File "/localdata/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/concoct-1.1.0-py3.10-linux-x86_64.egg/EGG-INFO/scripts/concoct", line 90, in <module>
    results = main(args)
  File "/localdata/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/concoct-1.1.0-py3.10-linux-x86_64.egg/EGG-INFO/scripts/concoct", line 37, in main
    transform_filter, pca = perform_pca(
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/concoct-1.1.0-py3.10-linux-x86_64.egg/concoct/transform.py", line 5, in perform_pca
    pca_object = PCA(n_components=nc, random_state=seed).fit(d)
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 435, in fit
    self._fit(X)
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 485, in _fit
    X = self._validate_data(
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/sklearn/base.py", line 548, in _validate_data
    self._check_feature_names(X, reset=reset)
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/sklearn/base.py", line 415, in _check_feature_names
    feature_names_in = _get_feature_names(X)
  File "/usr/local/miniconda3/envs/anvio-dev/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1903, in _get_feature_names
    raise TypeError(
TypeError: Feature names are only supported if all input features have string names, but your input has ['int', 'str'] as feature name / column name types. If you want feature names to be stored and validated, you must convert them all to strings, by using X.columns = X.columns.astype(str) for example. Otherwise you can remove feature / column names from your input data, or convert them all to a non-string data type.

Files / commands to reproduce the issue

anvi-cluster-contigs -p SAMPLES-MERGED/PROFILE.db -c CONTIGS.db --driver concoct -T 80 --clusters 10 -C METABINS --just-do-it

My files are too big to share unfortunately.

Ge0rges commented 11 months ago

I confirmed this occurs in v8 as well.

Ge0rges commented 11 months ago

Ok figured this out. Turns out it is a known issue with CONCOCT due to the fact that it is no longer compatible with the latest versions of sklearn.

If CONCOCT was installed with Conda this would not be an issue as the Conda recipe caps the sklearn version. However that is not the case if one follows the anvio instructions. @meren what's the best solution here? Either change the way CONCOCT is installed to use Conda, or change the Anvi'o instructions to use either A) a singularity container of CONCOCT (a pain) or B) cap the sklearn version of anvi'o (probably a pain later since CONCOCT doesn't seem to be maintained), and there's always C) nothing but print a warning.

Ge0rges commented 11 months ago

I confirmed this by doing pip install scikit-learn==1.1.0 in my anvi'o environment. After that, anvi-cluster-contigs completes successfully.

meren commented 11 months ago

Thank you very much for looking into this, @Ge0rges. I'll take a look and see if I can come up with a workaround for this. The current version of sklearn is 1.2.2. In the worst case scenario we can require 1.1.1.

Sabrin2020 commented 10 months ago

I confirmed this by doing pip install scikit-learn==1.1.0 in my anvi'o environment. After that, anti-cluster-contigs completes successfully.

I did same and concoct worked fine but I wonder if running pip install scikit-learn==1.1.0 could break Anvio rules somewhere else ?

meren commented 10 months ago

Since you were able to do the downgrade, it means the environment is stable. If this version breaks something, you will certainly notice that :) I think you're good.

Sabrin2020 commented 10 months ago

I am getting a new error with ecophylo workflow which was working fine before ```

RuleException:
TypeError in file /user/suga8254/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/workflows/ecophylo/Snakefile, line 358:
StringMethods.rsplit() takes from 1 to 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given
  File "/user/suga8254/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/workflows/ecophylo/Snakefile", line 358, in __rule_process_hmm_hits
  File "/user/suga8254/.conda/envs/anvio-8/lib/python3.10/site-packages/pandas/core/strings/accessor.py", line 136, in wrapper
  File "/user/suga8254/.conda/envs/anvio-8/lib/python3.10/concurrent/futures/thread.py", line 58, in run

That is why I am wondering !!
Ge0rges commented 10 months ago

Hi @Sabrin2020 can you confirm that it worked just by changing the scikit-learn version? i.e. if you upgrade scikit it works again?

Sabrin2020 commented 10 months ago

I just did that as test by going back to scikit-learn==1.2.2 and true it did not change and the ecophylo error still persist

Ge0rges commented 10 months ago

I would open a separate issue with your error with steps to reproduce.

meren commented 10 months ago

This is weird. Under no circumstance a change in scikit version number should cause an error in the threads module of Python. Probably these two things are independent :( But as a test, you can reinstall the anvi'o environment from scratch to see if you can reproduce it, @Sabrin2020.

Sabrin2020 commented 10 months ago

thanks @meren @Ge0rges I will reinstall the anvi'o environment from scratch

Sabrin2020 commented 10 months ago

@meren reinstalled the anvi'o environment and no loger have this error StringMethods.rsplit() takes from 1 to 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given I did not installed concoct in same environment yet.

Ge0rges commented 7 months ago

@meren may be useful to add a warning about this somewhere near the CONCOCT installation instructions on the website perhaps.

meren commented 7 months ago

I agree. Since we are no longer doing a lot of genome binning in the lab, those parts of the code and documentation is at the mercy of those who are using them outside :) If someone could formulate a warning text I could immediately put it somewhere in our installation instructions.

Ge0rges commented 7 months ago

Sure, meant to be somewhere near the CONCOCT install instructions:

Users should not that they may encounter an error when running CONCOCT of type TypeError. Please see here for more information about this. Here's the fix in a gist, at the end of your install and while in your conda environment do: pip install scikit-learn==1.1.0. Please let us know if this fix breaks any other part of Anvi'o. As of v8 we don't think it does.

meren commented 7 months ago

Thank you @Ge0rges. I updated the installation instructions. Now there is a little note that looks like this:

image