snowflakedb / snowflake-connector-python

Snowflake Connector for Python
https://pypi.python.org/pypi/snowflake-connector-python/
Apache License 2.0
568 stars 456 forks source link

SNOW-1444940: Write to pandas with datetime with timezone has no timezone in the resulting dataframe #1952

Closed frederiksteiner closed 6 days ago

frederiksteiner commented 1 month ago

Python version

3.11.8

Operating system and processor architecture

Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

Installed packages

pandas
snowflake
pytz

What did you do?

from snowflake.connector.pandas_tools import write_pandas
from snowflake import connector
import sys
import os
import pandas as pd
from datetime import datetime
import pytz

# snowflake connect configurations
snowflake_config = {
    "account": os.getenv("SF_ACCOUNT"),
    "user": os.getenv("SF_USER"),
    "role": os.getenv("SF_ROLE"),
    "password": os.getenv("SF_PASSWORD"),
    "warehouse": os.getenv("SF_WAREHOUSE"),
    "database": os.getenv("SF_DATABASE"),
    "schema": schema,
}

# create pandas dataframe with current timestamp
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})

with connector.connect(**snowflake_config) as con:
    # create temporary table
    r = write_pandas(
        conn=cnx,
        df=df,
        table_name="tmp1".upper(),
        auto_create_table=True,
        create_temp_table=True,
        use_logical_type=True,
    )
    r1 = cnx.cursor().execute("select * from tmp1").fetchall()

    # fetch current timestamp on snowflake server
    r2 = cnx.cursor().execute("select current_timestamp()").fetchall()

# display results
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")

What did you expect to see?

Similar to this issue here: 1687 Got this output:

2024-05-23 17:13:42.961893+02:00  -> Datetime generated in python
2024-05-23 15:13:42.961893 -> Incorrect timestamp after writing/fetching to temporary table
2024-05-23 15:13:46.099000+00:00  -> Current Timestamp on Snowflake

And I expect the time stamp of the temporary table to have a timezone: 2024-05-23 15:13:42.961893+00:00

Can you set logging to DEBUG and collect the logs?

import logging
import os

for logger_name in ('snowflake.connector',):
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
    logger.addHandler(ch)
frederiksteiner commented 1 month ago

Here in this Code Snippet the schema is inferred, which results in the following: column_type_mapping = {'DT': 'TIMESTAMP_NTZ'} since there are no timezones in a parquet file. Maybe one should update the dictionary for every timezoned timestamp to 'TIMESTAMP_TZ'

sfc-gh-sghosh commented 1 month ago

Hello @frederiksteiner ,

Thanks for raising the issue. Yes, we are able to reproduce the issue,

DataFrame created with timestamp: DT 0 2024-05-29 06:58:32.825147+02:00

2024-05-29 06:58:32.825147+02:00 -> Datetime generated in Python 2024-05-28 21:58:32.825147-07:00 -> Timestamp after writing/fetching to temporary table 2024-05-28 23:08:55.890000-07:00 -> Current Timestamp on Snowflake

The PR https://github.com/snowflakedb/snowflake-connector-python/pull/1954 is under review.

Regards, Sujan

frederiksteiner commented 3 weeks ago

Any updates on this?

sfc-gh-aalam commented 1 week ago

@frederiksteiner thank you so much for putting a PR to fix this issue. I apologize this PR wasn't prioritized earlier. The change looks good to me. Just need elaboration for one line. If you could provide that, we will merge it ASAP

sfc-gh-aalam commented 1 week ago

I checked with internal team and the conclusion is that parquet file do not allow represent TIMESTAMP_TZ which we use as an intermediate representation between pandas and snowflake. We write the column into TIMESTAMP_LTZ so as long as timezone for each row in pandas df is same and session TIMEZONE is same as the timezone the data shown will be same.

with connector.connect(**snowflake_config) as con:
    cnx.cursor().execute("alter session set TIMEZONE = 'Europe/Amsterdam'")

    # create temporary table
    r = write_pandas(
        conn=cnx,
        df=df,
        table_name="tmp1".upper(),
        auto_create_table=True,
        create_temp_table=True,
        use_logical_type=True,
    )
    r1 = cnx.cursor().execute("select * from tmp1").fetchall()

    # fetch current timestamp on snowflake server
    r2 = cnx.cursor().execute("select current_timestamp()").fetchall()

# display results
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")

this gives the result

2024-06-28 19:30:08.458500+02:00  -> Datetime generated in python
2024-06-28 19:30:08.458500+02:00  -> Incorrect timestamp after writing/fetching to temporary table
2024-06-29 02:30:10.423000+02:00  -> Current Timestamp on Snowflake
frederiksteiner commented 1 week ago

Ok, I see. I could maybe adapt the PR by converting the timezone in the pandas dataframe to the time zone of the session and then uploading it. That would probably work if the timezones are not identical right?

sfc-gh-aalam commented 6 days ago

No. write_pandas should not have this side-effect. It will cause a lot of confusion. A better suggestion is to set timezone explicitly for your use-case.

frederiksteiner commented 6 days ago

I tried the same again by setting the correct timezone, but that does not seem to work

# create pandas dataframe with current timestamp
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})

with connector.connect(**snowflake_config) as con:
    con.cursor().execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
    tzs = con.cursor().execute("SHOW PARAMETERS LIKE 'TIMEZONE' IN SESSION").fetchall()
    # create temporary table
    r = write_pandas(
        conn=con,
        df=df,
        table_name="tmp1".upper(),
        auto_create_table=True,
        create_temp_table=True,
        use_logical_type=True,
    )
    r1 = con.cursor().execute("select * from tmp1").fetchall()

    # fetch current timestamp on snowflake server
    r2 = con.cursor().execute("select current_timestamp()").fetchall()

# display results
print(f"{tzs[0][1]}  -> Timezone of current session")
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")

And this returns the following:

Europe/Amsterdam  -> Timezone of current session
2024-07-02 11:31:41.599519+02:00  -> Datetime generated in python
2024-07-02 09:31:41.599519  -> Incorrect timestamp after writing/fetching to temporary table
2024-07-02 11:31:47.592000+02:00  -> Current Timestamp on Snowflake
frederiksteiner commented 6 days ago
2024-06-28 19:30:08.458500+02:00  -> Datetime generated in python
2024-06-28 19:30:08.458500+02:00  -> Incorrect timestamp after writing/fetching to temporary table
2024-06-29 02:30:10.423000+02:00  -> Current Timestamp on Snowflake

This is still not correct though since all the three timestamps should be roughly the same

sfc-gh-aalam commented 5 days ago
2024-06-28 19:30:08.458500+02:00  -> Datetime generated in python
2024-06-28 19:30:08.458500+02:00  -> Incorrect timestamp after writing/fetching to temporary table
2024-06-29 02:30:10.423000+02:00  -> Current Timestamp on Snowflake

This is still not correct though since all the three timestamps should be roughly the same

I generated my dataframes using datetime.now() in a PST local timezone and then converted them into Europe/Amsterdam. Probably not the best way but I should have included it earlier.

df = pd.DataFrame({"DT": [
        datetime.now(),
        datetime.now(),
]})
df['DT'] = pd.to_datetime(df['DT'], utc=True)
df['DT'] = df['DT'].dt.tz_convert('Europe/Amsterdam')

Now when I create dataframe with timezone included in it, I do see all timezones very close to each other.

df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
cursor.execute("alter session set TIMEZONE = 'Europe/Amsterdam'")

# create temporary table
r = write_pandas(
    conn=conn,
    df=df,
    table_name="tmp1".upper(),
    auto_create_table=True,
    table_type="temp",
    use_logical_type=True,
)
r1 = cursor.execute("select * from tmp1").fetchall()

# fetch current timestamp on snowflake server
r2 = cursor.execute("select current_timestamp()").fetchall()

# display results
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")
2024-07-02 18:54:38.888995+02:00  -> Datetime generated in python
2024-07-02 18:54:38.888995+02:00  -> Incorrect timestamp after writing/fetching to temporary table
2024-07-02 18:54:40.904000+02:00  -> Current Timestamp on Snowflake
frederiksteiner commented 5 days ago

Thanks again for your answer, that looks a lot better now :)

I now did the exact same thing as you did, but I still get the wrong time (without timezone). I am not sure why this is only happening to me and do you might have an idea why this is happening?

with connector.connect(**snowflake_config) as con:
    df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})

    cursor = con.cursor()
    cursor.execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
    # create temporary table
    r = write_pandas(
        conn=con,
        df=df,
        table_name="tmp1".upper(),
        auto_create_table=True,
        table_type="temp",
        use_logical_type=True,
    )
    r1 = cursor.execute("select * from tmp1").fetchall()

    # fetch current timestamp on snowflake server
    r2 = cursor.execute("select current_timestamp()").fetchall()

# display results
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")

gives

2024-07-03 11:15:45.907769+02:00  -> Datetime generated in python
2024-07-03 09:15:45.907769  -> Incorrect timestamp after writing/fetching to temporary table
2024-07-03 11:15:50.897000+02:00  -> Current Timestamp on Snowflake

And when I create the table before writing it, the timestamp is also incorrect:

with connector.connect(**snowflake_config) as con:
    table_name = "tmp1"
    col_name = "DT"
    create_sql = f"CREATE OR REPLACE TABLE {table_name} ({col_name} TIMESTAMP_TZ)"
    drop_sql = f"DROP TABLE IF EXISTS {table_name}"
    df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})

    cursor = con.cursor()
    cursor.execute("alter session set TIMEZONE = 'Europe/Amsterdam'")
    cursor.execute(create_sql).fetchall()
    # create temporary table
    try:
        r = write_pandas(
            conn=con,
            df=df,
            table_name=table_name.upper(),
            auto_create_table=False,
            table_type="temp",
            use_logical_type=True,
        )
        r1 = cursor.execute(f"select * from {table_name}").fetchall()

        # fetch current timestamp on snowflake server
        r2 = cursor.execute("select current_timestamp()").fetchall()
    finally:
        con.execute_string(drop_sql)

# display results
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")

This returns:

2024-07-03 12:20:22.137864+02:00  -> Datetime generated in python
2024-07-03 10:20:22.137864+02:00  -> Incorrect timestamp after writing/fetching to temporary table
2024-07-03 12:20:25.306000+02:00  -> Current Timestamp on Snowflake

Now the timestamp has a timezone, but it is not the correct one but 2 hours behind.

sfc-gh-aalam commented 4 days ago

@frederiksteiner what is the version of python connector you are using? Can you upgrade to latest version and see if you see the same behavior?

frederiksteiner commented 4 days ago

With version 3.11.0 and the newest main branch.

First, I upgraded the version of my repo where I was using the snowflake connector. Since that didn't work, I also tried it on the newest commit on the main branch. What I did was I created new branch of the main branch here and added a main file with the above code.