obidam / pyxpcm

A Python implementation of Profile Classification Modelling (PCM) for xarray
https://pyxpcm.readthedocs.io
GNU General Public License v3.0
21 stars 10 forks source link

"Fit" method triggers "TypeError: unsupported operand type(s) for +: 'int' and 'str'" #37

Closed bbarcelollull closed 1 year ago

bbarcelollull commented 1 year ago

Hi @gmaze,

I want to use the pyXpcm tool to cluster glider profiles in the Mediterranean Sea. I have installed the pyXpcm module on my computer (in a virtual environment) with the required dependencies (although with Python 3.7):

    Package                       Version
    ----------------------------- ---------
    alabaster                     0.7.12
    appnope                       0.1.3
    Babel                         2.11.0
    backcall                      0.2.0
    certifi                       2022.12.7
    cftime                        1.6.2
    charset-normalizer            2.1.1
    cycler                        0.11.0
    dask                          0.16.0
    decorator                     5.1.1
    docutils                      0.19
    idna                          3.4
    imagesize                     1.4.1
    importlib-metadata            5.1.0
    ipython                       7.34.0
    jedi                          0.18.2
    Jinja2                        3.1.2
    joblib                        1.2.0
    kiwisolver                    1.4.4
    MarkupSafe                    2.1.1
    matplotlib                    3.0.0
    matplotlib-inline             0.1.6
    netCDF4                       1.6.2
    numpy                         1.21.6
    numpydoc                      1.5.0
    packaging                     22.0
    pandas                        0.24.0
    parso                         0.8.3
    pexpect                       4.8.0
    pickleshare                   0.7.5
    pip                           22.3.1
    prompt-toolkit                3.0.36
    ptyprocess                    0.7.0
    Pygments                      2.13.0
    pyparsing                     3.0.9
    python-dateutil               2.8.2
    pytz                          2022.6
    pyxpcm                        0.4.1
    requests                      2.28.1
    scikit-learn                  1.0.2
    scipy                         1.7.3
    seaborn                       0.9.0
    setuptools                    65.6.3
    six                           1.16.0
    snowballstemmer               2.2.0
    Sphinx                        5.3.0
    sphinxcontrib-applehelp       1.0.2
    sphinxcontrib-devhelp         1.0.2
    sphinxcontrib-htmlhelp        2.0.0
    sphinxcontrib-jsmath          1.0.1
    sphinxcontrib-qthelp          1.0.3
    sphinxcontrib-serializinghtml 1.1.5
    threadpoolctl                 3.1.0
    toolz                         0.12.0
    traitlets                     5.7.1
    typing_extensions             4.4.0
    urllib3                       1.26.13
    wcwidth                       0.2.5
    wheel                         0.38.4
    xarray                        0.12.1
    zipp                          3.11.0

However, when trying to run this example: https://pyxpcm.readthedocs.io/en/latest/example.html

I cannot run this line:

m.fit(ds, features=features_in_ds, dim=features_zdim)

Because I have the following error:

    TypeError                                 Traceback (most recent call last)
    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/PCM_example.py in <module>
         24 features_zdim='DEPTH'
         25
    ---> 26 m.fit(ds, features=features_in_ds, dim=features_zdim)

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in fit(self, ds, features, dim)
        859         with self._context('fit', self._context_args) :
        860             # PRE-PROCESSING:
    --> 861             X, sampling_dims = self.preprocessing(ds, features=features, dim=dim, action='fit')
        862
        863             # CLASSIFICATION-MODEL TRAINING:

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing(self, ds, features, dim, action, mask)
        785                                                                dim=dim,
        786                                                                feature_name=feature_in_pcm,
    --> 787                                                                action=action)
        788                     xlabel = ["%s_%i"%(feature_in_pcm, i) for i in range(0, x.shape[1])]
        789                     if self._debug:

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing_this(self, da, dim, feature_name, action)
        637             # MAKE THE ND-ARRAY A 2D-ARRAY
        638             with self._context(this_context + '.1-ravel', self._context_args):
    --> 639                 X, z, sampling_dims = self.ravel(da, dim=dim, feature_name=feature_name)
        640                 if self._debug:
        641                     print("\t", "X RAVELED with success", str(LogDataType(X)))

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in ravel(self, da, dim, feature_name)
        358             z = da[dim].values
        359
    --> 360         X = X.chunk(chunks={'sampling': self._props['chunk_size']})
        361         return X, z, sampling_dims
        362

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataarray.py in chunk(self, chunks, name_prefix, token, lock)
        812
        813         ds = self._to_temp_dataset().chunk(chunks, name_prefix=name_prefix,
    --> 814                                            token=token, lock=lock)
        815         return self._from_temp_dataset(ds)
        816

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataset.py in chunk(self, chunks, name_prefix, token, lock)
       1484
       1485         variables = OrderedDict([(k, maybe_chunk(k, v, chunks))
    -> 1486                                  for k, v in self.variables.items()])
       1487         return self._replace(variables)
       1488

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataset.py in <listcomp>(.0)
       1484
       1485         variables = OrderedDict([(k, maybe_chunk(k, v, chunks))
    -> 1486                                  for k, v in self.variables.items()])
       1487         return self._replace(variables)
       1488

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataset.py in maybe_chunk(name, var, chunks)
       1479                 token2 = tokenize(name, token if token else var._data)
       1480                 name2 = '%s%s-%s' % (name_prefix, name, token2)
    -> 1481                 return var.chunk(chunks, name=name2, lock=lock)
       1482             else:
       1483                 return var

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/variable.py in chunk(self, chunks, name, lock)
        893             data = indexing.ImplicitToExplicitIndexingAdapter(
        894                 data, indexing.OuterIndexer)
    --> 895             data = da.from_array(data, chunks, name=name, lock=lock)
        896
        897         return type(self)(self.dims, data, self._attrs, self._encoding,

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem)
       1913     >>> a = da.from_array(x, chunks=(1000, 1000), lock=True)  # doctest: +SKIP
       1914     """
    -> 1915     chunks = normalize_chunks(chunks, x.shape)
       1916     if len(chunks) != len(x.shape):
       1917         raise ValueError("Input array has %d dimensions but the supplied "

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in normalize_chunks(chunks, shape)
       1862         chunks = sum((blockdims_from_blockshape((s,), (c,))
       1863                       if not isinstance(c, (tuple, list)) else (c,)
    -> 1864                       for s, c in zip(shape, chunks)), ())
       1865     for c in chunks:
       1866         if not c:

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in <genexpr>(.0)
       1862         chunks = sum((blockdims_from_blockshape((s,), (c,))
       1863                       if not isinstance(c, (tuple, list)) else (c,)
    -> 1864                       for s, c in zip(shape, chunks)), ())
       1865     for c in chunks:
       1866         if not c:

    ~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in blockdims_from_blockshape(shape, chunks)
        919     if shape is None:
        920         raise TypeError("Must supply shape= keyword argument")
    --> 921     if np.isnan(sum(shape)) or np.isnan(sum(chunks)):
        922         raise ValueError("Array chunk sizes are unknown. shape: %s, chunks: %s"
        923                          % (shape, chunks))

    TypeError: unsupported operand type(s) for +: 'int' and 'str'

Do you know how can I solve it?

Thanks! Bàrbara

gmaze commented 1 year ago

Hi @bbarcelollull

1- Did you tried to run the code with the tutorial data, like in the example ? If it runs, this means that your install is correct and that it's your data setup that needs to be fixed 2- In the case where the example runs, then it's hard to help without more information. Could you please post the peace of code defining features_in_ds, features_zdim and paste a print of the ds dataset ?

g

bbarcelollull commented 1 year ago

I am trying to run the tutorial example with the same data you provide.

Here the code that I run:

from pyxpcm.models import pcm
import numpy as np
import pyxpcm

z = np.arange(0.,-1000,-10.)
pcm_features = {'temperature': z, 'salinity':z}

m = pcm(K=8, features=pcm_features)
print(m)

ds = pyxpcm.tutorial.open_dataset('argo').load()
print(ds)

features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}

features_zdim='DEPTH'

m.fit(ds, features=features_in_ds, dim=features_zdim)

And here what I get:

/Users/bbarcelo/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/plot.py:30: UserWarning: pyXpcm requires matplotlib installed for plotting functionality
  warnings.warn("pyXpcm requires matplotlib installed for plotting functionality")
/Users/bbarcelo/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/plot.py:38: UserWarning: pyXpcm requires cartopy installed for full plotting functionality
  warnings.warn("pyXpcm requires cartopy installed for full plotting functionality")
/Users/bbarcelo/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/matplotlib/__init__.py:886: MatplotlibDeprecationWarning: 
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
<pcm 'gmm' (K: 8, F: 2)>
Number of class: 8
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: False
Feature: 'temperature'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
<xarray.Dataset>
Dimensions:    (DEPTH: 282, N_PROF: 7560)
Coordinates:
  * DEPTH      (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Dimensions without coordinates: N_PROF
Data variables:
    LATITUDE   (N_PROF) float32 ...
    LONGITUDE  (N_PROF) float32 ...
    TIME       (N_PROF) datetime64[ns] ...
    DBINDEX    (N_PROF) float64 ...
    TEMP       (N_PROF, DEPTH) float32 ...
    PSAL       (N_PROF, DEPTH) float32 ...
    SIG0       (N_PROF, DEPTH) float32 ...
    BRV2       (N_PROF, DEPTH) float32 ...
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-642f557d4184> in <module>
     17 features_zdim='DEPTH'
     18 
---> 19 m.fit(ds, features=features_in_ds, dim=features_zdim)

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in fit(self, ds, features, dim)
    859         with self._context('fit', self._context_args) :
    860             # PRE-PROCESSING:
--> 861             X, sampling_dims = self.preprocessing(ds, features=features, dim=dim, action='fit')
    862 
    863             # CLASSIFICATION-MODEL TRAINING:

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing(self, ds, features, dim, action, mask)
    785                                                                dim=dim,
    786                                                                feature_name=feature_in_pcm,
--> 787                                                                action=action)
    788                     xlabel = ["%s_%i"%(feature_in_pcm, i) for i in range(0, x.shape[1])]
    789                     if self._debug:

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing_this(self, da, dim, feature_name, action)
    637             # MAKE THE ND-ARRAY A 2D-ARRAY
    638             with self._context(this_context + '.1-ravel', self._context_args):
--> 639                 X, z, sampling_dims = self.ravel(da, dim=dim, feature_name=feature_name)
    640                 if self._debug:
    641                     print("\t", "X RAVELED with success", str(LogDataType(X)))

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/pyxpcm/models.py in ravel(self, da, dim, feature_name)
    358             z = da[dim].values
    359 
--> 360         X = X.chunk(chunks={'sampling': self._props['chunk_size']})
    361         return X, z, sampling_dims
    362 

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataarray.py in chunk(self, chunks, name_prefix, token, lock)
    812 
    813         ds = self._to_temp_dataset().chunk(chunks, name_prefix=name_prefix,
--> 814                                            token=token, lock=lock)
    815         return self._from_temp_dataset(ds)
    816 

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataset.py in chunk(self, chunks, name_prefix, token, lock)
   1484 
   1485         variables = OrderedDict([(k, maybe_chunk(k, v, chunks))
-> 1486                                  for k, v in self.variables.items()])
   1487         return self._replace(variables)
   1488 

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataset.py in <listcomp>(.0)
   1484 
   1485         variables = OrderedDict([(k, maybe_chunk(k, v, chunks))
-> 1486                                  for k, v in self.variables.items()])
   1487         return self._replace(variables)
   1488 

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/dataset.py in maybe_chunk(name, var, chunks)
   1479                 token2 = tokenize(name, token if token else var._data)
   1480                 name2 = '%s%s-%s' % (name_prefix, name, token2)
-> 1481                 return var.chunk(chunks, name=name2, lock=lock)
   1482             else:
   1483                 return var

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/xarray/core/variable.py in chunk(self, chunks, name, lock)
    893             data = indexing.ImplicitToExplicitIndexingAdapter(
    894                 data, indexing.OuterIndexer)
--> 895             data = da.from_array(data, chunks, name=name, lock=lock)
    896 
    897         return type(self)(self.dims, data, self._attrs, self._encoding,

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem)
   1913     >>> a = da.from_array(x, chunks=(1000, 1000), lock=True)  # doctest: +SKIP
   1914     """
-> 1915     chunks = normalize_chunks(chunks, x.shape)
   1916     if len(chunks) != len(x.shape):
   1917         raise ValueError("Input array has %d dimensions but the supplied "

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in normalize_chunks(chunks, shape)
   1862         chunks = sum((blockdims_from_blockshape((s,), (c,))
   1863                       if not isinstance(c, (tuple, list)) else (c,)
-> 1864                       for s, c in zip(shape, chunks)), ())
   1865     for c in chunks:
   1866         if not c:

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in <genexpr>(.0)
   1862         chunks = sum((blockdims_from_blockshape((s,), (c,))
   1863                       if not isinstance(c, (tuple, list)) else (c,)
-> 1864                       for s, c in zip(shape, chunks)), ())
   1865     for c in chunks:
   1866         if not c:

~/HOME_SCIENCE/Scripts/2023_glider_clustering/PCM_run_on_environment/venv_pyXpcm/lib/python3.7/site-packages/dask/array/core.py in blockdims_from_blockshape(shape, chunks)
    919     if shape is None:
    920         raise TypeError("Must supply shape= keyword argument")
--> 921     if np.isnan(sum(shape)) or np.isnan(sum(chunks)):
    922         raise ValueError("Array chunk sizes are unknown. shape: %s, chunks: %s"
    923                          % (shape, chunks))

TypeError: unsupported operand type(s) for +: 'int' and 'str'
gmaze commented 1 year ago

so it's the same error, hence this is surely due to your Python environment a miniman environment like this one can be used

bbarcelollull commented 1 year ago

Thanks @gmaze! Issue solved!

I downloaded this file: https://github.com/euroargodev/boundary_currents_pcm/blob/main/environment.yml

And I created the environment from the environment.yml file (changing the name of the environment that is on the first line):

conda env create -f environment.yml

Now I can run my codes after activating the environment (named env_pyxpcm ) on the terminal:

conda activate env_pyxpcm

Then to close the environment:

conda deactivate

Or I can open the Anaconda Navigador, select the environment in which I want to run my codes (env_pyxpcm) and open and work with Spyder.

gmaze commented 1 year ago

glad you solved this !