[Bug, ML]: unhashable type: 'numpy.ndarray' on timeseries data

ilugid commented 1 year ago

Your Environment

Python version: python3.8
Operating system: Ubuntu 20.04.1 LTS
Lightwood version: lightwood==23.6.2.0
MindsDB==23.6.5.1
mindsdb-evaluator==0.0.9
mindsdb-sql==0.6.5
mindsdb-streams==0.1.1
numba==0.57.1
numpy==1.22.4
sktime==0.14.1

Describe your issue

Creating a model for timeseries data (order by and window clause) throws below error.

TypeError: unhashable type: 'numpy.ndarray', raised at: /srv/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/libs/ml_exec_base.py#135

How can we replicate it?

We used a data set (csv file) with about 4K entries similar to below.

col1, col2, col3, col4, col5
string 1 1, string 2 1, 2023-01-01, 246149, X
string 1 2, string 2 2, 2023-01-01, 246149, Y
string 1 1, string 2 1, 2023-01-01, 246149, Y

Create Model query

CREATE MODEL mindsdb.named_model
FROM files (
  SELECT col1, col2, col3, col4, col5 FROM named_table
  )
PREDICT col5
ORDER BY col3
GROUP BY col1, col2, col3, col4
WINDOW 3;

Strack Trace

Traceback (most recent call last):
  File "/srv/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/libs/ml_exec_base.py", line 137, in learn_process
    ml_handler.create(target, df=training_data_df, args=problem_definition)
  File "/srv/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/lightwood_handler.py", line 69, in create
    run_learn(
  File "/srv/mindsdb/lib/python3.8/site-packages/mindsdb/utilities/functions.py", line 60, in wrapper
    return func(*args, **kwargs)
  File "/srv/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/functions.py", line 159, in run_learn
    run_fit(predictor_id, df, model_storage)
  File "/srv/mindsdb/lib/python3.8/site-packages/mindsdb/utilities/functions.py", line 60, in wrapper
    return func(*args, **kwargs)
  File "/srv/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/functions.py", line 129, in run_fit
    raise e
  File "/srv/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/functions.py", line 97, in run_fit
    predictor.learn(df)
  File "/srv/mindsdb/lib/python3.8/site-packages/lightwood/helpers/log.py", line 30, in wrap
    result = f(predictor, *args, **kw)
  File "/tmp/1142d8bfe875224ca7ac227122cabf08fcd8ae2d2bf4fe4716878995327929509.py", line 495, in learn
    data = self.preprocess(data)
  File "/srv/mindsdb/lib/python3.8/site-packages/lightwood/helpers/log.py", line 30, in wrap
    result = f(predictor, *args, **kw)
  File "/tmp/1142d8bfe875224ca7ac227122cabf08fcd8ae2d2bf4fe4716878995327929509.py", line 178, in preprocess
    data = transform_timeseries(
  File "/srv/mindsdb/lib/python3.8/site-packages/lightwood/data/timeseries_transform.py", line 196, in transform_timeseries
    df_gb_list = list(combined_df.groupby(tss.group_by))
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 706, in get_iterator
    splitter = self._get_splitter(data, axis=axis)
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 720, in _get_splitter
    ids, _, ngroups = self.group_info
  File "pandas/_libs/properties.pyx", line 37, in pandas._libs.properties.CachedProperty.__get__
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 834, in group_info
    comp_ids, obs_group_ids = self._get_compressed_codes()
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 858, in _get_compressed_codes
    group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 793, in codes
    return [ping.codes for ping in self.groupings]
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 793, in <listcomp>
    return [ping.codes for ping in self.groupings]
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 622, in codes
    return self._codes_and_uniques[0]
  File "pandas/_libs/properties.pyx", line 37, in pandas._libs.properties.CachedProperty.__get__
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 690, in _codes_and_uniques
    codes, uniques = algorithms.factorize(
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/algorithms.py", line 763, in factorize
    codes, uniques = factorize_array(
  File "/srv/mindsdb/lib/python3.8/site-packages/pandas/core/algorithms.py", line 560, in factorize_array
    uniques, codes = table.factorize(
  File "pandas/_libs/hashtable_class_helper.pxi", line 5394, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 5310, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'numpy.ndarray'

tomhuds commented 1 year ago

@ilugid please could you provide an extract of the data? that would be very helpful for debugging this kind of error

ilugid commented 1 year ago

labeled_data.csv

Please find attached a hashed out file. Please let me know if it works for you.

paxcema commented 1 year ago

@ilugid Thanks! Any chance you can share with us the output of DESCRIBE mindsdb.named_model.jsonai?

ilugid commented 1 year ago

export.csv

Please find attached the JSON output in CSV format.

paxcema commented 1 year ago

@ilugid can you double check the above belongs to the faulty model? There are no set columns for ORDER BY or GROUP BY in this JsonAI so it looks like a different model.

ilugid commented 1 year ago

export1.csv

Hi sorry about that. Attached is the one with group by and window.

paxcema commented 1 year ago

Thanks for the data @ilugid, we were able to reproduce the issue. A fix should make its way into the release next week, PR for tracking is here.

mindsdb / lightwood