snowflakedb / snowpark-python

Snowflake Snowpark Python API
Apache License 2.0
267 stars 110 forks source link

SNOW-886649: write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date #991

Open dwelden opened 1 year ago

dwelden commented 1 year ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.10.5

  2. What operating system and processor architecture are you using?

    Windows-10-10.0.19044-SP0

  3. What are the component versions in the environment (pip freeze)? pipfreeze.txt

  4. What did you do?

    
    import pandas as pd
    from snowflake.snowpark import Session

df = pd.read_sas(source_file, format='sas7bdat', encoding='cp1252') session.write_pandas(df, target_table, overwrite=overwrite)

df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 152130 entries, 0 to 152129 Data columns (total 5 columns):

Column Non-Null Count Dtype


0 ProductKey 152130 non-null object 1 DistributionCenter 152130 non-null object 2 DATE 152130 non-null datetime64[ns] 3 Discount 152130 non-null float64 4 Revenue 151830 non-null float64 dtypes: datetime64ns, float64(2), object(2) memory usage: 5.8+ MB

df ProductKey DistributionCenter DATE Discount Revenue 0 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-10 0.0 0.0 1 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-17 0.0 0.0 2 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-24 0.0 0.0 3 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-03-03 0.0 0.0 4 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-03-10 0.0 0.0 ... ... ... ... ... ... 152125 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-02-15 0.0 27027.0 152126 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-02-22 0.0 28124.0 152127 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-01 0.0 25574.0 152128 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-08 0.0 25812.0 152129 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-15 0.0 25532.0

[152130 rows x 5 columns]



5. What did you expect to see?

Expected Pandas datetime column to be written to Snowflake as a TIMESTAMP_NTZ. Dates are not loading properly.

PRODUCTKEY | DISTRIBUTIONCENTER | DATE | DISCOUNT | REVENUE
-- | -- | -- | -- | --
Cleanser | Cary | Invalid Date | 0 | 0
Shampoo | Cary | Invalid Date | 0 | 0
... | ... | ... | ... | ...

6. Can you set logging to DEBUG and collect the logs?

[sas2snow2.log](https://github.com/snowflakedb/snowpark-python/files/12261832/sas2snow2.log)
dwelden commented 1 year ago

The data set used is DCSKINPRODUCT.sas7bdat from SAS-Visual-Forecasting---sample-data-sets

DCSKINPRODUCT.zip

-- Query Snowflake to examine the invalid dates
select to_varchar("DATE") from dcskinproduct limit 20;

43113053-09-24 17:00:00.000
43132219-01-30 16:00:00.000
43151384-06-05 17:00:00.000
43170549-10-11 17:00:00.000
43189715-02-16 16:00:00.000
43208880-06-22 17:00:00.000
43228045-10-28 17:00:00.000
43247211-03-05 16:00:00.000
43266376-07-10 17:00:00.000
43285541-11-15 16:00:00.000
43304707-03-23 17:00:00.000
43323872-07-27 17:00:00.000
43343037-12-02 16:00:00.000
43362203-04-09 17:00:00.000
43381368-08-13 17:00:00.000
43400533-12-19 16:00:00.000
43419699-04-25 17:00:00.000
43438864-08-30 17:00:00.000
43458030-01-05 16:00:00.000
43477195-05-13 17:00:00.000

The data type datetime64[ns] is stored internally as an int64 nanoseconds timedelta as of 1970-01-01. Given the Year values Snowflake has here, my suspicion is that the write_pandas is evaluating as if the timedelta were in milliseconds.

>>> df['DATE'][0], df['DATE'].astype('int64')[0]
(Timestamp('2013-02-10 00:00:00'), 1360454400000000000)
select to_varchar(timestampadd('milliseconds', 1360454400000000000, '1970-01-01'::timestamp));
43113053-09-25 00:00:00.000

select to_varchar(timestampadd('nanoseconds', 1360454400000000000, '1970-01-01'::timestamp));
2013-02-10 00:00:00.000
dwelden commented 1 year ago

For now, I have the following workaround in place, but this should not be required.

    # Map datetime columns to string (needed because of bug in write_pandas)
    for column in df.columns.to_list():
        if is_datetime64_dtype(df[column]):
            df[column] = df[column].dt.strftime('%Y-%m-%d %H:%M:%S.%f')
ericpettengill commented 12 months ago

linking snowflake-connector-python#600 and comment. New release of snowflake-connector-python adds use_logical_type param

sfc-gh-aalam commented 12 months ago

closing this issue as use_logical_type is working as expected. Please open a new issue if you see more issues.

dwelden commented 11 months ago

When will this paramter be available in Snowpark? The new parameter is not available in Snowpark for Python version 1.9.0 or 1.10.0 snowflake.snowpark.Session.write_pandas.

>>> session.write_pandas(df, target_table, use_logical_type=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Session.write_pandas() got an unexpected keyword argument 'use_logical_type'
>>> from snowflake.snowpark.version import VERSION as snowpark_version
>>> print(snowpark_version)
(1, 10, 0)
sfc-gh-aalam commented 11 months ago

It is coming in the next release.

sfc-gh-aalam commented 11 months ago

My bad. I thought this was snowflake-connector-python issue

ameryisafreeelf commented 3 months ago

BUMP- I'm currently running into related issues, happy to open a new issue but I assume this one is still open for a good reason. Thanks in advance for any eyes here.

Edit: Python deps

I'm successfully passing use_logical_type to write_pandas(), but at the Snowflake level, this seems to be allowing the COPY from staged Parquet -> Snowflake to write junk into tables. Most notably, I have the following case:

As a user, this is hard for me to triage because the staged parquets are difficult for me to peek into. In my opinion, the write_pandas command should not be able to write something like Invalid Date to Snowflake in a TIMESTAMP column. It'd be nice if Snowflake couldn't store that to begin with, but I assume there's a good reason for that and it's a layer deeper than the issue at hand.

Is there a good workaround for this? My current fix is just to avoid using use_logical_type altogether.