snowflakedb / snowpark-python

Snowflake Snowpark Python API
Apache License 2.0
249 stars 106 forks source link

SNOW-1526571: Mock_iff not working if there are null values in the columns #1887

Closed frederiksteiner closed 3 weeks ago

frederiksteiner commented 1 month ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    3.11.8

  2. What operating system and processor architecture are you using?

    Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

  3. What are the component versions in the environment (pip freeze)?

    snowflake-connector-python==3.11.0 snowflake-snowpark-python==1.19.0

  4. What did you do? Start local testing session and create

    
    from snowflake.snowpark import Session
    import snowflake.snowpark.functions as spf
    conn_params = {
    "schema": self.schema,
    "local_testing": True,
    "timezone": gc.timezone_local,
    **kwargs,
    }

session = Session.builder.configs(conn_params).create() import snowflake.snowpark.functions as spf from snowflake.snowpark.window import Window data = [ (1, 1, 1, None), (1, 1, 2, None), (1, 1, 3, 1), (1, 1, 4, None), (1, 1, 5, None), (1, 1, 6, 2), (1, 1, 7, None), (1, 1, 8, None), ] schema = ["COL1","COL2","COL3","COLUMN_TO_FILL"]

df = session.create_dataframe( data=data, schema=schema, ) window = Window.partition_by(["COL1", "COL2"]).order_by("COL3") lead = spf.lead(df.col("COLUMN_TO_FILL"), ignore_nulls=True).over(window) lag = spf.lag(df.col("COLUMN_TO_FILL"), ignore_nulls=True).over(window)

max_lead_lag = spf.iff(lead > lag, lead, lag) df = df.with_column("MAX_LEAD_LAG",max_lead_lag) result = df.to_pandas() expected = pd.DataFrame( [ (1, 1, 1, None, None), (1, 1, 2, None, None), (1, 1, 3, 1, None), (1, 1, 4, None, 2.0), (1, 1, 5, None, 2.0), (1, 1, 6, 2, 1.0), (1, 1, 7, None, 2.0), (1, 1, 8, None, 2.0), ], columns=schema + ["MAX_LEAD_LAG"], ) pd.testing.assert_frame_equal( result, expected, check_dtype=False, )


5. What did you expect to see?

   That this should assert to True and not false. 

The column "MAX_LEAD_LAG" is all Nones. The reason for it is that the lead and lag column are NullType columns afterwards and hence the max_lead_lag is full of Nones. Hence it is not really a problem of mock_iff but of the lead and lag testing method returning NullType columns.

@patch("iff") def mock_iff(condition: ColumnEmulator, expr1: ColumnEmulator, expr2: ColumnEmulator): assert isinstance(condition.sf_type.datatype, BooleanType)

coerce_result = get_coerce_result_type(expr1.sf_type, expr2.sf_type) #### This is none, since both lead and lag are NullTypecolumns
if all(condition) or all(~condition) or coerce_result is not None:
    res = ColumnEmulator(data=[None] * len(condition), dtype=object)
    expr1 = cast_column_to(expr1, coerce_result) ##### This returns empty
    expr2 = cast_column_to(expr2, coerce_result) ##### This returns empty
    res.where(condition, other=expr2, inplace=True)
    res.where([not x for x in condition], other=expr1, inplace=True)
    res.sf_type = coerce_result
    return res ##### Now this is full of Nones too
else:
    raise SnowparkLocalTestingException(
        f"[Local Testing] expr1 and expr2 have conflicting datatypes that cannot be coerced: {expr1.sf_type} <-> {expr2.sf_type}"
    )
sfc-gh-sghosh commented 1 month ago

Hello @frederiksteiner ,

Thanks for raising the issue, we are checking, will update.

Regards, Sujan

sfc-gh-sghosh commented 1 month ago

Hello @frederiksteiner ,

I checked the code snippet, there is no issue with normal session, but I am not able to reproduce the exact issue with local testing.

Using snowpark 3.19.0 and Python 3.11 The issue coming when converting the snowflake df to pandas df.

FutureWarning: Mismatched null-like values None and nan found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise. pd.testing.assert_frame_equal(result, expected, check_dtype=False, check_like=True)

`Result COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG 0 1 1 1 NaN None 1 1 1 2 NaN None 2 1 1 3 1.0 None 3 1 1 4 NaN None 4 1 1 5 NaN None 5 1 1 6 2.0 None 6 1 1 7 NaN None 7 1 1 8 NaN None

Expected COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG 0 1 1 1 NaN NaN 1 1 1 2 NaN NaN 2 1 1 3 1.0 NaN 3 1 1 4 NaN 2.0 4 1 1 5 NaN 2.0 5 1 1 6 2.0 1.0 6 1 1 7 NaN 2.0 7 1 1 8 NaN 2.0`

AssertionError: DataFrame.iloc[:, 4] (column name="MAX_LEAD_LAG") are different

DataFrame.iloc[:, 4] (column name="MAX_LEAD_LAG") values are different (62.5 %) [index]: [0, 1, 2, 3, 4, 5, 6, 7] [left]: [None, None, None, None, None, None, None, None] [right]: [nan, nan, nan, 2.0, 2.0, 1.0, 2.0, 2.0] At positional index 3, first diff: None != 2.0

Could you check if any other configurations done with local_testing.

Regards, Sujan

sfc-gh-sghosh commented 4 weeks ago

Hello @frederiksteiner ,

Did you get a chance to check the previous update?

Regards, Sujan

frederiksteiner commented 4 weeks ago

Hey @sfc-gh-sghosh

i did not yet since I do not have a laptop on me this week, but I will check on Monday and reply again.

Sorry for the inconvenience. Kind regards, Frederik

frederiksteiner commented 3 weeks ago

Hey @sfc-gh-sghosh You are right, the result and expected looks as you indicated and I didn't copy it correctly. It should look like yours.

`Result COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG 0 1 1 1 NaN None 1 1 1 2 NaN None 2 1 1 3 1.0 None 3 1 1 4 NaN None 4 1 1 5 NaN None 5 1 1 6 2.0 None 6 1 1 7 NaN None 7 1 1 8 NaN None

Expected COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG 0 1 1 1 NaN NaN 1 1 1 2 NaN NaN 2 1 1 3 1.0 NaN 3 1 1 4 NaN 2.0 4 1 1 5 NaN 2.0 5 1 1 6 2.0 1.0 6 1 1 7 NaN 2.0 7 1 1 8 NaN 2.0`

But the issue still stands

sfc-gh-aling commented 3 weeks ago

thanks for the PR @frederiksteiner , I have merged my PR https://github.com/snowflakedb/snowpark-python/pull/1959 which is a duplicate of yours with some additional updates.

sorry I couldn't merge yours due to our ci/cd limitations that external PR could not run tests.

frederiksteiner commented 3 weeks ago

Alright, thanks for the new PR fixing this issue. I will close this issue then and also close my PR :)