man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.45k stars 93 forks source link

Incorrect error message when updating or appending with object that cannot be normalized #1293

Closed poodlewars closed 6 months ago

poodlewars commented 7 months ago

Describe the bug

When you attempt to perform an update (and also an append) using an input dataframe that we cannot normalize, then you receive a misleading error message. We also fail to properly explain why the normalization failed.

There is no point continuing once the normalization fails. We know that appends and updates will not work with a pickled object. We should bail out early with a helpful error message.

Steps/Code to Reproduce

df = pd.DataFrame(index=[pd.Timestamp("2024-01-01"), pd.Timestamp("2024-01-02")], data={"a": [pd.Timestamp("2023-01-01"), pd.Timestamp("2023-01-02")]})
lib.write("ts", df)
upd = pd.DataFrame(index=[pd.Timestamp("2024-01-01"), pd.Timestamp("2024-01-02")], data={"a": [pd.Timestamp("2023-01-01"), [1,2,3]]})
lib.update("ts", upd)

This causes an ArcticException with this logging and traceback:

In [49]: lib.update("ts", upd)
[2024-02-01 17:20:16.317] [arcticdb] [error] Could not normalize item of type: <class 'pandas.core.frame.DataFrame'> with any normalizer.You can set pickle_on_failure param to force pickling of this object instead.(Note: Pickling has worse performance and stricter memory limitations)
[2024-02-01 17:20:16.319] [arcticdb] [error] Error while normalizing symbol=ts, data=                              a
2024-01-01  2023-01-01 00:00:00
2024-01-02            [1, 2, 3], metadata=None, Could not convert object to NumPy datetime
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_store.py:334, in NativeVersionStore._try_normalize(self, symbol, dataframe, metadata, pickle_on_failure, dynamic_strings, coerce_columns, **kwargs)
    332     else:
    333         # TODO: just for pandas dataframes for now.
--> 334         item, norm_meta = self._normalizer.normalize(
    335             dataframe,
    336             pickle_on_failure=pickle_on_failure,
    337             dynamic_strings=dynamic_strings,
    338             coerce_columns=coerce_columns,
    339             dynamic_schema=dynamic_schema,
    340             **kwargs,
    341         )
    342 except ArcticDbNotYetImplemented as ex:

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_normalization.py:1201, in CompositeNormalizer.normalize(self, item, string_max_len, pickle_on_failure, dynamic_strings, coerce_columns, **kwargs)
   1200 try:
-> 1201     return self._normalize(
   1202         item,
   1203         string_max_len=string_max_len,
   1204         dynamic_strings=dynamic_strings,
   1205         coerce_columns=coerce_columns,
   1206         **kwargs,
   1207     )
   1208 except Exception as ex:

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_normalization.py:1148, in CompositeNormalizer._normalize(self, item, string_max_len, dynamic_strings, coerce_columns, **kwargs)
   1147 log.debug("Normalizer used: {}".format(normalizer))
-> 1148 return normalizer(
   1149     item,
   1150     string_max_len=string_max_len,
   1151     dynamic_strings=dynamic_strings,
   1152     coerce_columns=coerce_columns,
   1153     **kwargs,
   1154 )

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_normalization.py:838, in DataFrameNormalizer.normalize(self, item, string_max_len, dynamic_strings, coerce_columns, **kwargs)
    837     columns_vals = [item.iloc[:, idx].values for idx in range(len(item.columns))]
--> 838 columns, column_vals = _normalize_columns(
    839     item.columns,
    840     columns_vals,
    841     norm_meta.df,
    842     coerce_columns=coerce_columns,
    843     dynamic_strings=dynamic_strings,
    844     string_max_len=string_max_len,
    845     dynamic_schema=kwargs.get("dynamic_schema", False),
    846     index_names=index_names,
    847 )
    848 if item.columns.name is not None:

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_normalization.py:467, in _normalize_columns(columns_names, columns_vals, norm_meta, coerce_columns, dynamic_strings, string_max_len, dynamic_schema, index_names)
    462     raise ArcticNativeException(
    463         "mismatch in columns_name and vals size in _normalize_columns {} != {}".format(
    464             len(columns_names_norm), len(columns_vals)
    465         )
    466     )
--> 467 column_vals = [
    468     _to_primitive(
    469         columns_vals[idx],
    470         columns_names_norm[idx],
    471         string_max_len=string_max_len,
    472         dynamic_strings=dynamic_strings,
    473         coerce_column_type=coerce_columns[str(columns_names[idx])] if coerce_columns else None,
    474         norm_meta=norm_meta,
    475     )
    476     for idx in range(len(columns_names_norm))
    477 ]
    478 return columns_names_norm, column_vals

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_normalization.py:468, in <listcomp>(.0)
    462     raise ArcticNativeException(
    463         "mismatch in columns_name and vals size in _normalize_columns {} != {}".format(
    464             len(columns_names_norm), len(columns_vals)
    465         )
    466     )
    467 column_vals = [
--> 468     _to_primitive(
    469         columns_vals[idx],
    470         columns_names_norm[idx],
    471         string_max_len=string_max_len,
    472         dynamic_strings=dynamic_strings,
    473         coerce_column_type=coerce_columns[str(columns_names[idx])] if coerce_columns else None,
    474         norm_meta=norm_meta,
    475     )
    476     for idx in range(len(columns_names_norm))
    477 ]
    478 return columns_names_norm, column_vals

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_normalization.py:234, in _to_primitive(arr, arr_name, dynamic_strings, string_max_len, coerce_column_type, norm_meta)
    233     log.debug("Removing all NaNs from column: {} of type datetime64", arr_name)
--> 234     return arr.astype(DTN64_DTYPE)
    235 elif _accept_array_string(sample):

ValueError: Could not convert object to NumPy datetime

During handling of the above exception, another exception occurred:

ArcticException                           Traceback (most recent call last)
Cell In[49], line 1
----> 1 lib.update("ts", upd)

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/library.py:826, in Library.update(self, symbol, data, metadata, upsert, date_range, prune_previous_versions)
    759 def update(
    760     self,
    761     symbol: str,
   (...)
    766     prune_previous_versions=False,
    767 ) -> VersionedItem:
    768     """
    769     Overwrites existing symbol data with the contents of ``data``. The entire range between the first and last index
    770     entry in ``data`` is replaced in its entirety with the contents of ``data``, adding additional index entries if
   (...)
    824     2018-01-04       4
    825     """
--> 826     return self._nvs.update(
    827         symbol=symbol,
    828         data=data,
    829         metadata=metadata,
    830         upsert=upsert,
    831         date_range=date_range,
    832         prune_previous_version=prune_previous_versions,
    833     )

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_store.py:790, in NativeVersionStore.update(self, symbol, data, metadata, date_range, upsert, prune_previous_version, **kwargs)
    786     data = restrict_data_to_date_range_only(data, start=start, end=end)
    788 _handle_categorical_columns(symbol, data)
--> 790 udm, item, norm_meta = self._try_normalize(symbol, data, metadata, False, dynamic_strings, coerce_columns)
    792 if isinstance(item, NPDDataFrame):
    793     with _diff_long_stream_descriptor_mismatch(self):

File ~/venvs/310/lib/python3.10/site-packages/arcticdb/version_store/_store.py:347, in NativeVersionStore._try_normalize(self, symbol, dataframe, metadata, pickle_on_failure, dynamic_strings, coerce_columns, **kwargs)
    345 except Exception as ex:
    346     log.error("Error while normalizing symbol={}, data={}, metadata={}, {}", symbol, dataframe, metadata, ex)
--> 347     raise ArcticNativeException(str(ex))
    349 if norm_meta is None:
    350     raise ArcticNativeException("Cannot normalize input {}".format(symbol))

ArcticException: Could not convert object to NumPy datetime

Expected Results

[2024-02-01 17:20:16.317] [arcticdb] [error] Could not normalize item of type: <class 'pandas.core.frame.DataFrame'> with any normalizer.You can set pickle_on_failure param to force pickling of this object instead.(Note: Pickling has worse performance and stricter memory limitations)

is misleading for two reasons:

We should also explain better why the normalization failed - down to which column is at fault.

OS, Python Version and ArcticDB Version

Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] OS: Linux-6.5.0-14-generic-x86_64-with-glibc2.35 ArcticDB: 4.2.1

Backend storage used

LMDB

Additional Context

cf internal thread https://chat-man.slack.com/archives/CKD4V6N0H/p1706808901627019?thread_ts=1706791167.627659&cid=CKD4V6N0H

poodlewars commented 7 months ago

Relates to #91 ?