snowflakedb / snowflake-connector-python

Snowflake Connector for Python
https://pypi.python.org/pypi/snowflake-connector-python/
Apache License 2.0
585 stars 468 forks source link

SNOW-1362666: Multiple calls to `write_pandas` with the same global random seed can result in stage name collisions #1937

Closed scottfleming closed 4 months ago

scottfleming commented 5 months ago

Python version

Python 3.11.6 (main, Oct 8 2023, 05:06:43) [GCC 13.2.0]

Operating system and processor architecture

Linux-6.6.16-linuxkit-aarch64-with-glibc2.38

Installed packages

snowflake-connector-python==3.5.0

What did you do?

I call write_pandas in a loop, but there's another function from a different package in that same loop that resets the global seed to a specific state (I recognize this is poor form and am working on a fix for that separately).

The problem is that snowflake-connector-python's _create_temp_stage function calls random_string to generate temporary stage names, over which the user does not have control:

https://github.com/snowflakedb/snowflake-connector-python/blob/8cdca67faf3fa2a6d31a545d78326b76c31c4c64/src/snowflake/connector/pandas_tools.py#L95

random_string in turn relies on the random module:

https://github.com/snowflakedb/snowflake-connector-python/blob/8cdca67faf3fa2a6d31a545d78326b76c31c4c64/test/randomize.py#L30

so if the user sets the seed deterministically outside the scope of calling write_pandas then this will result in stage name collisions throwing an opaque error such as the following:

snowflake.connector.errors.ProgrammingError: 002002 (42710): 01b41633-0305-1f43-0030-3596065ff54f: SQL compilation error:
Object 'USERDB.USERSCHEMA."ziwewnsovq"' already exists.

which triggers on line 81 in _do_create_temp_stage: https://github.com/snowflakedb/snowflake-connector-python/blob/8cdca67faf3fa2a6d31a545d78326b76c31c4c64/src/snowflake/connector/pandas_tools.py#L81

What did you expect to see?

What I wish would have happened instead is that there would be one of the following:

  1. An approach for random generation of stage names that does not rely on the global random seed, such as uuid
  2. A more informative error message
  3. Automatic handling of collisions by eg deleting or overwriting an existing stage with the same name, with optional user control over which action to take (could be that the user actually wants to error out in these cases for whatever reason)
  4. A way for the user to directly override the randomly generated stage_name

Can you set logging to DEBUG and collect the logs?

No response

sfc-gh-sghosh commented 5 months ago

Hello @scottfleming ,

Thanks for raising the issue, we are looking into it, will update.

Regards, Sujan

sfc-gh-sghosh commented 4 months ago

Hello @scottfleming ,

I tried to reproduce the issue with Python connector 3.6.0 and Python 3.11, but it did not throw any collision with stage names.

Could you share the code snippet to reproduce the issue.

`import random import snowflake.connector import logging

for logger_name in ['snowflake.connector', 'botocore', 'boto3']: logger = logging.getLogger(logger_name) logger.setLevel(logging.DEBUG) ch = logging.FileHandler('/Users/sghosh/IdeaProjects/Python_prj/tmp/seed.log') ch.setLevel(logging.DEBUG) ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s')) logger.addHandler(ch)

def reset_seed(): random.seed(42)

def another_function(): reset_seed() # Reset seed before calling write_pandas

def simulate_data_processing(connection): cursor = connection.cursor() try: cursor.execute("CREATE or replace table my_table (column1 INT, column2 STRING)") data = [(1, 'a'), (2, 'b'), (3, 'c')]

    for row in data:
        cursor.execute("INSERT INTO my_table  VALUES (%s, %s)", row)

    connection.commit()  # Commit the transaction

    # Fetch values from the table
    cursor.execute("SELECT * FROM my_table")
    rows = cursor.fetchall()
    print("Values from the table:")
    for row in rows:
        print(row)

finally:
    cursor.close()

Main function

def main():

random.seed(123)
# Connect to Snowflake
conn = snowflake.connector.connect(
    user='xxx',
    password='xxxx',
    account='xxxx',
    warehouse='SUJAN_WH',
    database='SAMPLEDATABASE',
    schema='TEST'
)

for i in range(5):  # Repeat the process multiple times
    another_function()  # Call another function that resets the seed
    simulate_data_processing(conn)  # Simulate data processing

conn.close()  # Close the Snowflake connection

if name == "main": main() `

Regards, Sujan

sfc-gh-sghosh commented 4 months ago

Hello @scottfleming ,

Closing the issue as we did not receive any response, and the issue is not getting reproduced. You can reopen when able to reproduce and provide the required artifacts.

Regards, Sujan