vscentrum / vsc-software-stack

Central repository of easyconfigs used in the software installations on VSC clusters.
2 stars 6 forks source link

anvio #280

Closed boegel closed 6 months ago

boegel commented 6 months ago
boegel commented 6 months ago

I spent quite a bit of time to try and get anvio v8 working on top of foss/2023a, but ran into trouble because the scikit-learn and pandas (in SciPy-bundle were too new).

When using scikit-learn-1.3.1-gfbf-2023a.eb as dependency for anvio-8-foss-2023a.eb, the "anvi-self-test --suite mini --no-interactive" sanity check command was failing with "ValueError: node array from the pickle has an incompatible dtype".

``` Traceback (most recent call last): File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/bin/anvi-interactive", line 122, in d = interactive.Interactive(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/interactive.py", line 211, in __init__ self.completeness = Completeness(self.contigs_db_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/completeness.py", line 45, in __init__ self.SCG_domain_predictor = scgdomainclassifier.Predict(argparse.Namespace(), run=terminal.Run(verbose=False), progress=self.progress) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/scgdomainclassifier.py", line 234, in __init__ SCGDomainClassifier.__init__(self, args, run, progress) File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/scgdomainclassifier.py", line 73, in __init__ self.rf.initialize_classifier() File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/learning.py", line 103, in initialize_classifier classifier_obj = pickle.load(open(self.classifier_object_path, 'rb')) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "sklearn/tree/_tree.pyx", line 728, in sklearn.tree._tree.Tree.__setstate__ File "sklearn/tree/_tree.pyx", line 1432, in sklearn.tree._tree._check_node_ndarray ValueError: node array from the pickle has an incompatible dtype: - expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['

When using scikit-learn 1.2.2 as extension in anvio-8-foss-2023a.eb, the error changed to "gzip.BadGzipFile: Incorrect length of data produced"

``` Traceback (most recent call last): File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/bin/anvi-summarize", line 123, in main(args) File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/bin/anvi-summarize", line 64, in main summary = summarizer.ProfileSummarizer(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/summarizer.py", line 707, in __init__ DatabasesMetaclass.__init__(self, self.args, self.run, self.progress) File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/dbops.py", line 3778, in __init__ ProfileSuperclass.__init__(self, self.args, self.run, self.progress) File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/dbops.py", line 3014, in __init__ self.init_gene_level_coverage_stats_dicts(outliers_threshold=outliers_threshold, File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/dbops.py", line 3191, in init_gene_level_coverage_stats_dicts self.init_split_coverage_values_per_nt_dict(split_names) File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/dbops.py", line 3254, in init_split_coverage_values_per_nt_dict self.split_coverage_values_per_nt_dict[split_name] = self.split_coverage_values.get(split_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/auxiliarydataops.py", line 149, in get coverage_array = utils.convert_binary_blob_to_numpy_array(blob, dtype=self.coverage_dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/anvio/8-foss-2023a/lib/python3.11/site-packages/anvio/utils.py", line 782, in convert_binary_blob_to_numpy_array return np.frombuffer(gzip.decompress(blob), dtype=dtype) ^^^^^^^^^^^^^^^^^^^^^ File "/user/gent/400/vsc40023/eb_arcaninescratch/RHEL8/skylake-ib/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/gzip.py", line 614, in decompress raise BadGzipFile("Incorrect length of data produced") gzip.BadGzipFile: Incorrect length of data produced ```

My best guess is that this is caused by using a too recent pandas (2.0.3 as included in SciPy-bundle v2023.07, instead of the expected pandas 1.4.4).

These problems do not occur when using foss/2022b and the standard scikit-learn 1.2.1 + pandas 1.4.2 (in SciPy-bundle 2022.05).

boegel commented 6 months ago

https://github.com/easybuilders/easybuild-easyconfigs/pull/19771

boegel commented 6 months ago

PR merged, software installed, so closing...