Open dwelden opened 1 year ago
The data set used is DCSKINPRODUCT.sas7bdat from SAS-Visual-Forecasting---sample-data-sets
-- Query Snowflake to examine the invalid dates
select to_varchar("DATE") from dcskinproduct limit 20;
43113053-09-24 17:00:00.000
43132219-01-30 16:00:00.000
43151384-06-05 17:00:00.000
43170549-10-11 17:00:00.000
43189715-02-16 16:00:00.000
43208880-06-22 17:00:00.000
43228045-10-28 17:00:00.000
43247211-03-05 16:00:00.000
43266376-07-10 17:00:00.000
43285541-11-15 16:00:00.000
43304707-03-23 17:00:00.000
43323872-07-27 17:00:00.000
43343037-12-02 16:00:00.000
43362203-04-09 17:00:00.000
43381368-08-13 17:00:00.000
43400533-12-19 16:00:00.000
43419699-04-25 17:00:00.000
43438864-08-30 17:00:00.000
43458030-01-05 16:00:00.000
43477195-05-13 17:00:00.000
The data type datetime64[ns] is stored internally as an int64 nanoseconds timedelta as of 1970-01-01. Given the Year values Snowflake has here, my suspicion is that the write_pandas is evaluating as if the timedelta were in milliseconds.
>>> df['DATE'][0], df['DATE'].astype('int64')[0]
(Timestamp('2013-02-10 00:00:00'), 1360454400000000000)
select to_varchar(timestampadd('milliseconds', 1360454400000000000, '1970-01-01'::timestamp));
43113053-09-25 00:00:00.000
select to_varchar(timestampadd('nanoseconds', 1360454400000000000, '1970-01-01'::timestamp));
2013-02-10 00:00:00.000
For now, I have the following workaround in place, but this should not be required.
# Map datetime columns to string (needed because of bug in write_pandas)
for column in df.columns.to_list():
if is_datetime64_dtype(df[column]):
df[column] = df[column].dt.strftime('%Y-%m-%d %H:%M:%S.%f')
linking snowflake-connector-python#600 and comment. New release of snowflake-connector-python
adds use_logical_type
param
closing this issue as use_logical_type
is working as expected. Please open a new issue if you see more issues.
When will this paramter be available in Snowpark? The new parameter is not available in Snowpark for Python version 1.9.0 or 1.10.0 snowflake.snowpark.Session.write_pandas.
>>> session.write_pandas(df, target_table, use_logical_type=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Session.write_pandas() got an unexpected keyword argument 'use_logical_type'
>>> from snowflake.snowpark.version import VERSION as snowpark_version
>>> print(snowpark_version)
(1, 10, 0)
It is coming in the next release.
My bad. I thought this was snowflake-connector-python issue
BUMP- I'm currently running into related issues, happy to open a new issue but I assume this one is still open for a good reason. Thanks in advance for any eyes here.
Edit: Python deps
I'm successfully passing use_logical_type
to write_pandas()
, but at the Snowflake level, this seems to be allowing the COPY from staged Parquet -> Snowflake to write junk into tables. Most notably, I have the following case:
datetime.datetime
that gets written to parquet with use_logical_type
. I believe this allows for parquet-specific encoding (under the hood, storing the timestamp as an integer offset from a unix epoch). Invalid date
being written to a column of type TIMESTAMP_NTZ(9)
. As a user, this is hard for me to triage because the staged parquets are difficult for me to peek into. In my opinion, the write_pandas command should not be able to write something like Invalid Date
to Snowflake in a TIMESTAMP column. It'd be nice if Snowflake couldn't store that to begin with, but I assume there's a good reason for that and it's a layer deeper than the issue at hand.
Is there a good workaround for this? My current fix is just to avoid using use_logical_type
altogether.
Please answer these questions before submitting your issue. Thanks!
What version of Python are you using?
Python 3.10.5
What operating system and processor architecture are you using?
Windows-10-10.0.19044-SP0
What are the component versions in the environment (
pip freeze
)? pipfreeze.txtWhat did you do?
df = pd.read_sas(source_file, format='sas7bdat', encoding='cp1252') session.write_pandas(df, target_table, overwrite=overwrite)
[152130 rows x 5 columns]