Closed dwelden closed 1 year ago
Forgot to include the sample data from the dataframe showing the date values expected:
>>> df
ProductKey DistributionCenter DATE Discount Revenue
0 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-10 0.0 0.0
1 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-17 0.0 0.0
2 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-24 0.0 0.0
3 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-03-03 0.0 0.0
4 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-03-10 0.0 0.0
... ... ... ... ... ...
152125 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-02-15 0.0 27027.0
152126 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-02-22 0.0 28124.0
152127 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-01 0.0 25574.0
152128 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-08 0.0 25812.0
152129 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-15 0.0 25532.0
[152130 rows x 5 columns]
Potentially a duplicate of https://github.com/snowflakedb/snowflake-connector-python/issues/600?
It may be related to 600. I did not get the error message described there. It simply wrote invalid values to Snowflake.
Are all data written as Invalid Date
? Can I get a small dataset to reproduce?
The data set used is DCSKINPRODUCT.sas7bdat from SAS-Visual-Forecasting---sample-data-sets
-- Query Snowflake to examine the invalid dates
select to_varchar("DATE") from dcskinproduct limit 20;
43113053-09-24 17:00:00.000
43132219-01-30 16:00:00.000
43151384-06-05 17:00:00.000
43170549-10-11 17:00:00.000
43189715-02-16 16:00:00.000
43208880-06-22 17:00:00.000
43228045-10-28 17:00:00.000
43247211-03-05 16:00:00.000
43266376-07-10 17:00:00.000
43285541-11-15 16:00:00.000
43304707-03-23 17:00:00.000
43323872-07-27 17:00:00.000
43343037-12-02 16:00:00.000
43362203-04-09 17:00:00.000
43381368-08-13 17:00:00.000
43400533-12-19 16:00:00.000
43419699-04-25 17:00:00.000
43438864-08-30 17:00:00.000
43458030-01-05 16:00:00.000
43477195-05-13 17:00:00.000
The data type datetime64[ns] is stored internally as an int64 nanoseconds timedelta as of 1970-01-01. Given the Year values Snowflake has here, my suspicion is that the Snowflake connector is evaluating as if the timedelta were in milliseconds.
>>> df['DATE'][0], df['DATE'].astype('int64')[0]
(Timestamp('2013-02-10 00:00:00'), 1360454400000000000)
select to_varchar(timestampadd('milliseconds', 1360454400000000000, '1970-01-01'::timestamp));
43113053-09-25 00:00:00.000
select to_varchar(timestampadd('nanoseconds', 1360454400000000000, '1970-01-01'::timestamp));
2013-02-10 00:00:00.000
For now, I have the following workaround in place, but this should not be required.
# Map datetime columns to string (needed because of bug in write_pandas)
for column in df.columns.to_list():
if is_datetime64_dtype(df[column]):
df[column] = df[column].dt.strftime('%Y-%m-%d %H:%M:%S.%f')
I see. IIRC there is an ongoing rolling out of a feature that could impact how timestamp in parquet is interpreted, and write_pandas
uses parquet file as a intermediate step. Let me check with @sfc-gh-yuliu .
I confirmed with @sfc-gh-yuliu and it is related to the new logical type (nano second). There is an ongoing effort to rolling out this change and connector would need to support it with a new param (use_logical_type
). Please stay tuned.
This has been fixed - please see linked ticket
Python version
Python 3.10.5
Operating system and processor architecture
Windows-10-10.0.19044-SP0
Installed packages
What did you do?
What did you expect to see?
Expected Pandas datetime column to be written to Snowflake as a TIMESTAMP_NTZ. Dates are not loading properly.
Can you set logging to DEBUG and collect the logs?