snowflakedb / snowpark-python

Snowflake Snowpark Python API
Apache License 2.0
255 stars 106 forks source link

SNOW-1544694: `create_dataframe` does not use `schema` parameter when passing pandas dataframe #1936

Open fwalsh-pl opened 1 month ago

fwalsh-pl commented 1 month ago

What is the current behavior?

It appears that when calling create_dataframe with a pandas dataframe, the schema parameter passed is not actually used.

https://github.com/snowflakedb/snowpark-python/blob/c62df8121093e49d3b244add4826b90cb295f843/src/snowflake/snowpark/session.py#L2532

I'm not sure if this was intended or not, so I didn't list this as a bug

What is the desired behavior?

It would be nice to be able to set the schema when converting a pandas dataframe to a snowpark dataframe

How would this improve snowflake-snowpark-python?

This would allow api users more fine-grained control over their schemas instead of inferring the schema from pandas dataframe

References, Other Background

sfc-gh-sghosh commented 1 month ago

Hello @fwalsh-pl ,

Thanks for raising the issue. Could you share the code snippet where its not honoring the schema.

Regards, Sujan

fwalsh-pl commented 1 month ago

Sure thing, thanks for the help @sfc-gh-sghosh !

You can see below that specifying the schema in create_dataframe does not have any impact on the snowpark dataframe's schema when passing in a pandas dataframe. However, when passing in a list of tuples, the schema parameter does impact the snowpark dataframe's schema

import pandas as pd
from datetime import datetime, timezone
from snowflake.snowpark.types import StructType, TimestampType, TimestampTimeZone, IntegerType, DecimalType, StructField
import pytz

pdf = pd.DataFrame(
    {
        'DATE_COL': [datetime(2024, 1, 1, 8, 0, 0, tzinfo=pytz.timezone('UTC')), datetime(2024, 1, 2, 8, 0, 0, tzinfo=pytz.timezone('UTC'))],
        'FLOAT_COL': [24.50, 12.75],
        'INT_COL': [1, 2],
    }
)

schema = StructType(
    [
        StructField("DATE_COL", TimestampType(TimestampTimeZone.NTZ), True),
        StructField("FLOAT_COL", DecimalType(20, 2), True),
        StructField("INT_COL", IntegerType(), True),
    ]
)
snow_df_w_schema = sf_session.create_dataframe(pdf, schema=schema)
snow_df_w_schema.printSchema()
root
 |-- "DATE_COL": TimestampType(tz=ltz) (nullable = True)
 |-- "FLOAT_COL": DoubleType() (nullable = True)
 |-- "INT_COL": LongType() (nullable = True)
snow_df_no_schema = sf_session.create_dataframe(pdf)
snow_df_no_schema.printSchema()
root
 |-- "DATE_COL": TimestampType(tz=ltz) (nullable = True)
 |-- "FLOAT_COL": DoubleType() (nullable = True)
 |-- "INT_COL": LongType() (nullable = True)
list_tupes = list(pdf.itertuples(index=False))
snow_df_list_tupes = sf_session.create_dataframe(list_tupes, schema=schema)
snow_df_list_tupes.printSchema()
root
 |-- "DATE_COL": TimestampType(tz=ntz) (nullable = True)
 |-- "FLOAT_COL": DecimalType(20, 2) (nullable = True)
 |-- "INT_COL": LongType() (nullable = True)

Also, in the source code itself, it looks like in the pandas condition, the schema parameter is not used https://github.com/snowflakedb/snowpark-python/blob/c62df8121093e49d3b244add4826b90cb295f843/src/snowflake/snowpark/session.py#L2532

sfc-gh-sghosh commented 1 month ago

Hello @fwalsh-pl ,

Thanks for the code snippet. Checking, will update.

root |-- "DATE_COL": TimestampType(tz=ltz) (nullable = True) |-- "FLOAT_COL": DoubleType() (nullable = True) |-- "INT_COL": LongType() (nullable = True) root |-- "DATE_COL": TimestampType(tz=ltz) (nullable = True) |-- "FLOAT_COL": DoubleType() (nullable = True) |-- "INT_COL": LongType() (nullable = True)

root : tuples with schema |-- "DATE_COL": TimestampType(tz=ntz) (nullable = True) |-- "FLOAT_COL": DecimalType(20, 2) (nullable = True) |-- "INT_COL": LongType() (nullable = True)

Regards, Sujan

sfc-gh-sghosh commented 3 weeks ago

Hello @fwalsh-pl ,

We checked further, this is working as per current design, the workaround is to create the dataframe then convert column data type. We didn't change it yet as this would be a breaking change for many customers.

Regards, Sujan