snowflakedb / snowpark-python

Snowflake Snowpark Python API
Apache License 2.0
256 stars 107 forks source link

SNOW-1622029: Table.update() raises TypeError if table contains any VariantType columns #2067

Open djfletcher opened 1 month ago

djfletcher commented 1 month ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

Python 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)]

  1. What are the Snowpark Python and pandas versions in the environment?

pandas==2.2.2 snowflake-snowpark-python==1.20.0

  1. What did you do?

I am updating a Table row in my tests. I can reproduce using the same code as https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.Table.update but with one extra variant column. Updating any column, even if it is not the VariantType column, raises a TypeError:

session = Session.builder.config("local_testing", True).create()
target_df = session.create_dataframe([(1, 1, {}),(1, 2, {}),(2, 1, {}),(2, 2, {}),(3, 1, {}),(3, 2, {})], schema=["a", "b", "c"])
target_df.write.save_as_table("my_table", mode="overwrite", table_type="temporary")
t = session.table("my_table")
t.update({"b": 0}, t["a"] == 1)

Here is the stacktrace:

venv/lib/python3.9/site-packages/snowflake/snowpark/table.py:470: in update
    result = new_df._internal_collect_with_tag(
venv/lib/python3.9/site-packages/snowflake/snowpark/_internal/telemetry.py:150: in wrap
    result = func(*args, **kwargs)
venv/lib/python3.9/site-packages/snowflake/snowpark/dataframe.py:644: in _internal_collect_with_tag_no_telemetry
    return self._session._conn.execute(
venv/lib/python3.9/site-packages/snowflake/snowpark/mock/_connection.py:559: in execute
    res = execute_mock_plan(plan, plan.expr_to_alias)
venv/lib/python3.9/site-packages/snowflake/snowpark/mock/_plan.py:1166: in execute_mock_plan
    matched_count = intermediate[target.columns].value_counts(dropna=False)[
venv/lib/python3.9/site-packages/pandas/core/frame.py:7509: in value_counts
    counts = self.groupby(subset, dropna=dropna, observed=False)._grouper.size()
venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:705: in size
    ids, _, ngroups = self.group_info
properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
    ???
venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:745: in group_info
    comp_ids, obs_group_ids = self._get_compressed_codes()
venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:764: in _get_compressed_codes
    group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:690: in codes
    return [ping.codes for ping in self.groupings]
venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:690: in <listcomp>
    return [ping.codes for ping in self.groupings]
venv/lib/python3.9/site-packages/pandas/core/groupby/grouper.py:691: in codes
    return self._codes_and_uniques[0]
properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
    ???
venv/lib/python3.9/site-packages/pandas/core/groupby/grouper.py:835: in _codes_and_uniques
    codes, uniques = algorithms.factorize(  # type: ignore[assignment]
venv/lib/python3.9/site-packages/pandas/core/algorithms.py:795: in factorize
    codes, uniques = factorize_array(
venv/lib/python3.9/site-packages/pandas/core/algorithms.py:595: in factorize_array
    uniques, codes = table.factorize(
pandas/_libs/hashtable_class_helper.pxi:7281: in pandas._libs.hashtable.PyObjectHashTable.factorize
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   TypeError: unhashable type: 'dict'

pandas/_libs/hashtable_class_helper.pxi:7195: TypeError
  1. What did you expect to see?

The in-memory table should have been updated without raising a TypeError.

djfletcher commented 1 month ago

Per the documentation: https://docs.snowflake.com/en/developer-guide/snowpark/python/testing-locally#limitations

For Table.merge and Table.update, the session parameters ERROR_ON_NONDETERMINISTIC_UPDATE and ERROR_ON_NONDETERMINISTIC_MERGE must be set to False. This means that for multi-joins, one of the matched rows is updated.

Adding these params has no effect:

statement_params = {"ERROR_ON_NONDETERMINISTIC_UPDATE": False, "ERROR_ON_NONDETERMINISTIC_MERGE": False}
t.update({"b": 0}, t["a"] == 1, statement_params=statement_params)

E   TypeError: unhashable type: 'dict'
sfc-gh-sghosh commented 1 month ago

Hello @djfletcher ,

Thanks for raising the issue, yes, the issue is with local testing while updating the table and its working fine with regular session. Will work on eliminating it.

session = Session.builder.config("local_testing", True).create() target_df = session.create_dataframe([(1, 1, {}),(1, 2, {}),(2, 1, {}),(2, 2, {}),(3, 1, {}),(3, 2, {})], schema=["a", "b", "c"]) target_df.write.save_as_table("my_table", mode="overwrite", table_type="temporary") t = session.table("my_table") t.show() t.update({"b": 0}, t["a"] == 1) t.show()

Output and Error:

|"A" |"B" |"C" |

|1 |1 |{} | |1 |2 |{} | |2 |1 |{} | |2 |2 |{} | |3 |1 |{} | |3 |2 |{} |

TypeError: unhashable type: 'dict'

Regards, Sujan