Closed frederiksteiner closed 6 days ago
Here in this Code Snippet the schema is inferred, which results in the following:
column_type_mapping = {'DT': 'TIMESTAMP_NTZ'}
since there are no timezones in a parquet file. Maybe one should update the dictionary for every timezoned timestamp to 'TIMESTAMP_TZ'
Hello @frederiksteiner ,
Thanks for raising the issue. Yes, we are able to reproduce the issue,
DataFrame created with timestamp: DT 0 2024-05-29 06:58:32.825147+02:00
2024-05-29 06:58:32.825147+02:00 -> Datetime generated in Python 2024-05-28 21:58:32.825147-07:00 -> Timestamp after writing/fetching to temporary table 2024-05-28 23:08:55.890000-07:00 -> Current Timestamp on Snowflake
The PR https://github.com/snowflakedb/snowflake-connector-python/pull/1954 is under review.
Regards, Sujan
Any updates on this?
@frederiksteiner thank you so much for putting a PR to fix this issue. I apologize this PR wasn't prioritized earlier. The change looks good to me. Just need elaboration for one line. If you could provide that, we will merge it ASAP
I checked with internal team and the conclusion is that parquet file do not allow represent TIMESTAMP_TZ
which we use as an intermediate representation between pandas and snowflake. We write the column into TIMESTAMP_LTZ
so as long as timezone for each row in pandas df is same and session TIMEZONE
is same as the timezone the data shown will be same.
with connector.connect(**snowflake_config) as con:
cnx.cursor().execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
# create temporary table
r = write_pandas(
conn=cnx,
df=df,
table_name="tmp1".upper(),
auto_create_table=True,
create_temp_table=True,
use_logical_type=True,
)
r1 = cnx.cursor().execute("select * from tmp1").fetchall()
# fetch current timestamp on snowflake server
r2 = cnx.cursor().execute("select current_timestamp()").fetchall()
# display results
print(f"{str(df.DT[0])} -> Datetime generated in python")
print(f"{r1[0][0]} -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]} -> Current Timestamp on Snowflake")
this gives the result
2024-06-28 19:30:08.458500+02:00 -> Datetime generated in python
2024-06-28 19:30:08.458500+02:00 -> Incorrect timestamp after writing/fetching to temporary table
2024-06-29 02:30:10.423000+02:00 -> Current Timestamp on Snowflake
Ok, I see. I could maybe adapt the PR by converting the timezone in the pandas dataframe to the time zone of the session and then uploading it. That would probably work if the timezones are not identical right?
No. write_pandas
should not have this side-effect. It will cause a lot of confusion. A better suggestion is to set timezone explicitly for your use-case.
I tried the same again by setting the correct timezone, but that does not seem to work
# create pandas dataframe with current timestamp
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
with connector.connect(**snowflake_config) as con:
con.cursor().execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
tzs = con.cursor().execute("SHOW PARAMETERS LIKE 'TIMEZONE' IN SESSION").fetchall()
# create temporary table
r = write_pandas(
conn=con,
df=df,
table_name="tmp1".upper(),
auto_create_table=True,
create_temp_table=True,
use_logical_type=True,
)
r1 = con.cursor().execute("select * from tmp1").fetchall()
# fetch current timestamp on snowflake server
r2 = con.cursor().execute("select current_timestamp()").fetchall()
# display results
print(f"{tzs[0][1]} -> Timezone of current session")
print(f"{str(df.DT[0])} -> Datetime generated in python")
print(f"{r1[0][0]} -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]} -> Current Timestamp on Snowflake")
And this returns the following:
Europe/Amsterdam -> Timezone of current session
2024-07-02 11:31:41.599519+02:00 -> Datetime generated in python
2024-07-02 09:31:41.599519 -> Incorrect timestamp after writing/fetching to temporary table
2024-07-02 11:31:47.592000+02:00 -> Current Timestamp on Snowflake
2024-06-28 19:30:08.458500+02:00 -> Datetime generated in python 2024-06-28 19:30:08.458500+02:00 -> Incorrect timestamp after writing/fetching to temporary table 2024-06-29 02:30:10.423000+02:00 -> Current Timestamp on Snowflake
This is still not correct though since all the three timestamps should be roughly the same
2024-06-28 19:30:08.458500+02:00 -> Datetime generated in python 2024-06-28 19:30:08.458500+02:00 -> Incorrect timestamp after writing/fetching to temporary table 2024-06-29 02:30:10.423000+02:00 -> Current Timestamp on Snowflake
This is still not correct though since all the three timestamps should be roughly the same
I generated my dataframes using datetime.now()
in a PST local timezone and then converted them into Europe/Amsterdam
. Probably not the best way but I should have included it earlier.
df = pd.DataFrame({"DT": [
datetime.now(),
datetime.now(),
]})
df['DT'] = pd.to_datetime(df['DT'], utc=True)
df['DT'] = df['DT'].dt.tz_convert('Europe/Amsterdam')
Now when I create dataframe with timezone included in it, I do see all timezones very close to each other.
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
cursor.execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
# create temporary table
r = write_pandas(
conn=conn,
df=df,
table_name="tmp1".upper(),
auto_create_table=True,
table_type="temp",
use_logical_type=True,
)
r1 = cursor.execute("select * from tmp1").fetchall()
# fetch current timestamp on snowflake server
r2 = cursor.execute("select current_timestamp()").fetchall()
# display results
print(f"{str(df.DT[0])} -> Datetime generated in python")
print(f"{r1[0][0]} -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]} -> Current Timestamp on Snowflake")
2024-07-02 18:54:38.888995+02:00 -> Datetime generated in python
2024-07-02 18:54:38.888995+02:00 -> Incorrect timestamp after writing/fetching to temporary table
2024-07-02 18:54:40.904000+02:00 -> Current Timestamp on Snowflake
Thanks again for your answer, that looks a lot better now :)
I now did the exact same thing as you did, but I still get the wrong time (without timezone). I am not sure why this is only happening to me and do you might have an idea why this is happening?
with connector.connect(**snowflake_config) as con:
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
cursor = con.cursor()
cursor.execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
# create temporary table
r = write_pandas(
conn=con,
df=df,
table_name="tmp1".upper(),
auto_create_table=True,
table_type="temp",
use_logical_type=True,
)
r1 = cursor.execute("select * from tmp1").fetchall()
# fetch current timestamp on snowflake server
r2 = cursor.execute("select current_timestamp()").fetchall()
# display results
print(f"{str(df.DT[0])} -> Datetime generated in python")
print(f"{r1[0][0]} -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]} -> Current Timestamp on Snowflake")
gives
2024-07-03 11:15:45.907769+02:00 -> Datetime generated in python
2024-07-03 09:15:45.907769 -> Incorrect timestamp after writing/fetching to temporary table
2024-07-03 11:15:50.897000+02:00 -> Current Timestamp on Snowflake
And when I create the table before writing it, the timestamp is also incorrect:
with connector.connect(**snowflake_config) as con:
table_name = "tmp1"
col_name = "DT"
create_sql = f"CREATE OR REPLACE TABLE {table_name} ({col_name} TIMESTAMP_TZ)"
drop_sql = f"DROP TABLE IF EXISTS {table_name}"
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
cursor = con.cursor()
cursor.execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
cursor.execute(create_sql).fetchall()
# create temporary table
try:
r = write_pandas(
conn=con,
df=df,
table_name=table_name.upper(),
auto_create_table=False,
table_type="temp",
use_logical_type=True,
)
r1 = cursor.execute(f"select * from {table_name}").fetchall()
# fetch current timestamp on snowflake server
r2 = cursor.execute("select current_timestamp()").fetchall()
finally:
con.execute_string(drop_sql)
# display results
print(f"{str(df.DT[0])} -> Datetime generated in python")
print(f"{r1[0][0]} -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]} -> Current Timestamp on Snowflake")
This returns:
2024-07-03 12:20:22.137864+02:00 -> Datetime generated in python
2024-07-03 10:20:22.137864+02:00 -> Incorrect timestamp after writing/fetching to temporary table
2024-07-03 12:20:25.306000+02:00 -> Current Timestamp on Snowflake
Now the timestamp has a timezone, but it is not the correct one but 2 hours behind.
@frederiksteiner what is the version of python connector you are using? Can you upgrade to latest version and see if you see the same behavior?
With version 3.11.0 and the newest main branch.
First, I upgraded the version of my repo where I was using the snowflake connector. Since that didn't work, I also tried it on the newest commit on the main branch. What I did was I created new branch of the main branch here and added a main file with the above code.
Python version
3.11.8
Operating system and processor architecture
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Installed packages
What did you do?
What did you expect to see?
Similar to this issue here: 1687 Got this output:
And I expect the time stamp of the temporary table to have a timezone:
2024-05-23 15:13:42.961893+00:00
Can you set logging to DEBUG and collect the logs?