Closed bmwilly closed 3 years ago
Yes, it is because of that line you linked. I would like to make things case sensitive as the opposite would mean that some data could be inaccessible by the Python connector.
@sfc-gh-mkeller I think it would make sense to convert the names for the user (similar to what you do for the table name here). At the very least, there should be some kind of warning as returning a success from write_pandas
but writing empty rows is very confusing.
Oh. My. God! I cannot believe something as silly as this - 1 - actually exists and 2 - is not documented ANYWHERE. I've been banging my head against a wall for a week over this. Turns out, all it needs is a single character to be upper-case and it'll work. Of course, @bmwilly's workaround is preferred, but it just seems really arcane for this to even be a thing. At the very least, if this is not going to be fixed, it should be documented. Right now, there's neither.
Huge thanks to @bmwilly though! :)
I would like to make things case sensitive as the opposite would mean that some data could be inaccessible by the Python connector.
Can this be done in the background? The end-user shouldn't have to suffer.
I would like to make things case sensitive as the opposite would mean that some data could be inaccessible by the Python connector.
Can this be done in the background? The end-user shouldn't have to suffer.
I don't see how that would be possible.
However; you are right. I'll document that this API is case sensitive.
@sfc-gh-mkeller - Honestly, I don't fully understand the issue. Column name - "age" (for example) - fails, but "AGE", "Age", "aGe" or any combination where at least one alphabet is capitalized works. In Snowflake, at least in the UI, all column names (by default, and unless explicitly quoted by some process during the table creation) appear in upper case anyways. So, for any column name that is like this - "Age" or "aGe" - it'll work anyways and only "fails" if it is like this - "age". So, I think a safer implementation would be to simply handle it on the fly behind the scenes right?
Besides, the default implementation of to_sql()
doesn't really care, and the only reason we'd use pd_writer
is to load larger dataframes into Snowflake, because the 16k chunksize limit is just not going to cut it if the dataframe has over a million records. Given the default behavior, this quirk can be seen as an inconsistency. Thoughts?
@akshayi1 from https://docs.snowflake.com/en/sql-reference/identifiers-syntax.html#unquoted-identifiers you can see the following:
If an identifier is not enclosed in double quotes, it must begin with a letter or underscore (_) and cannot contain extended characters or blank spaces.
In general, always quoting identifiers is therefore safer. It would also mean that developers are aware what their identifiers really are (because their identifiers wouldn't be silently upper cased).
So, I think a safer implementation would be to simply handle it on the fly behind the scenes right?
We cannot guess correctly all times what a developer means, so why don't we just not guess in the first place and ask the developers to be specific.
My intention was only to give users more power, by allowing them to access all tables and not to overwrite the names they provided to write_pandas
. I'm sorry that this didn't work!
I see that my approach to slowly change the connector to be case sensitive is flawed (if it was to be done then it should be a sudden change across the whole connector). I will be reviewing this with my peers and making changes accordingly.
Thank you for the detailed feedback @sfc-gh-mkeller :) As things stand, even the example given in Snowflake's documentation - https://docs.snowflake.com/en/user-guide/python-connector-api.html#pd_writer - does not work. I did see your PR here - https://github.com/snowflakedb/snowflake-connector-python/pull/358 - but this only covers the code. Most people go to the documentation pages first, and that itself a bit out of sync with the underlying issue. Would you or someone in your team be able to please make suitable amends to the doc page? Thank you!
@sfc-gh-mkeller - One more thing to consider, maybe, is this - When devs use to_sql
, they don't care about the column naming convention. It just... works. However, the idea with pd_writer
is that it can re-use the existing SQLAlchemy engine and merely invoke write_pandas
, without having to change any code. But since this comes at the cost of changing column names' casing, it can (and does :)) throw people off.
I know you've acknowledged as much and understand that the idea was to give users the flexibility, and to that end, if it is difficult to, as you said, guess correctly what a developer means. Would a good workaround then be to simply double-quote all column names under the hood? That way, you'd never have to coerce casing of any kind, and the tables end up being whatever the developers want it to be.
Just to illustrate what I mean, this is the example from the documentation page on Snowflake -
import pandas
from snowflake.connector.pandas_tools import pd_writer
# Create a DataFrame containing data about customers
df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance'])
# Specify that the to_sql method should use the pd_writer function
# to write the data from the DataFrame to the table named "customers"
# in the Snowflake database.
df.to_sql('customers', engine, index=False, method=pd_writer)
which won't work, unless you change df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance'])
to, at the very least df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['Name', 'Balance'])
. This what I mean when I say that having even a single alphabet in the column name, regardless of double quoting or not, or where in the string, the alphabet is in upper-case, allows it to be written out as normal, and the NULL issue "goes away".
An interim solution for those of us who use Snowflake's default all-uppercase column names who don't want to type out each column name in caps (e.g. for 100+ columns like I have):
# Create a DataFrame containing data about customers
df = pd.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance']).rename(lambda x: x.upper(), axis=1)
df
NAME | BALANCE | |
---|---|---|
0 | Mark | 10 |
1 | Luke | 20 |
Additionally, the case sensitivity issues impact the table name as well, e.g.:
from snowflake.connector.pandas_tools import write_pandas
cnx.execute_string("""create or replace table customers (
row_id varchar(36) default uuid_string(),
ts timestamp default current_timestamp,
name text,
balance int);""")
# Create the connection to the Snowflake database.
#cnx = snowflake.connector.connect(...)
# Create a DataFrame containing data about customers
df = pd.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance']).rename(lambda x: x.upper(), axis=1)
# Write the data from the DataFrame to the table named "customers".
success, nchunks, nrows, _ = write_pandas(cnx, df, 'customers')
---------------------------------------------------------------------------
ProgrammingError Traceback (most recent call last)
in
13
14 # Write the data from the DataFrame to the table named "customers".
---> 15 success, nchunks, nrows, _ = write_pandas(cnx, df, 'customers')
/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/pandas_tools.py in write_pandas(conn, df, table_name, database, schema, chunk_size, compression, on_error, parallel)
133 )
134 logger.debug("copying into with '{}'".format(copy_into_sql))
--> 135 copy_results = cursor.execute(copy_into_sql, _is_internal=True).fetchall()
136 cursor.close()
137 return (all((e[1] == 'LOADED' for e in copy_results)),
/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/cursor.py in execute(self, command, params, timeout, _do_reset, _put_callback, _put_azure_callback, _put_callback_output_stream, _get_callback, _get_azure_callback, _get_callback_output_stream, _show_progress_bar, _statement_params, _is_internal, _no_results, _use_ijson, _is_put_get, _raise_put_get_error, _force_put_overwrite)
592 'sfqid': self._sfqid
593 }
--> 594 Error.errorhandler_wrapper(self.connection, self,
595 ProgrammingError,
596 errvalue)
/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py in errorhandler_wrapper(connection, cursor, error_class, error_value)
121 if cursor is not None:
122 cursor.messages.append((error_class, error_value))
--> 123 cursor.errorhandler(connection, cursor, error_class, error_value)
124 return
125 elif connection is not None:
/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py in default_errorhandler(connection, cursor, error_class, error_value)
81 A Snowflake error.
82 """
---> 83 raise error_class(
84 msg=error_value.get('msg'),
85 errno=error_value.get('errno'),
ProgrammingError: 001757 (42601): SQL compilation error:
Table '"customers"' does not exist
This (kind of)* works:
from snowflake.connector.pandas_tools import write_pandas
cnx.execute_string("""create or replace table customers (
row_id varchar(36) default uuid_string(),
ts timestamp default current_timestamp,
name text,
balance int);""")
# Create the connection to the Snowflake database.
#cnx = snowflake.connector.connect(...)
# Create a DataFrame containing data about customers
df = pd.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance']).rename(lambda x: x.upper(), axis=1)
# Write the data from the DataFrame to the table named "customers".
success, nchunks, nrows, _ = write_pandas(cnx, df, 'CUSTOMERS')
*This does not insert default values for the row_id and ts columns. That issue is here: https://github.com/snowflakedb/snowflake-connector-python/issues/361
Would you or someone in your team be able to please make suitable amends to the doc page?
I can do this, once we decide on exactly what the fix will be.
I know you've acknowledged as much and understand that the idea was to give users the flexibility, and to that end, if it is difficult to, as you said, guess correctly what a developer means. Would a good workaround then be to simply double-quote all column names under the hood? That way, you'd never have to coerce casing of any kind, and the tables end up being whatever the developers want it to be.
I apologize @akshayi1 , I thought we were talking about table, schema and database names until now.
Yes, so the differences are between write_pandas
and snowflake-sqlalchemy
.
In this case I'm proposing changing https://github.com/snowflakedb/snowflake-connector-python/blob/0215c0adadbbff6b33dfa32e52a1d0d887f557cf/pandas_tools.py#L128 from CASE_SENSITIVE
to CASE_INSENSITIVE
(https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#copy-options-copyoptions), but I'm going to need to do some testing first.
The alternative would be to change how snowflake-sqlalchemy
works, which would potentially break a lot more customers then changing write_pandas
to be insensitive.
Making it "CASE_INSENSITIVE" would probably be for the best :) Hope that works out okay.
Please answer these questions before submitting your issue. Thanks!
python --version
)?python -c 'import platform; print(platform.platform())'
)?pip list
)?Using
write_pandas
with a DataFrame with lower-cased names causes empty rows to be written to the table. There are no warnings or exceptions. It is due to https://github.com/snowflakedb/snowflake-connector-python/blob/master/pandas_tools.py#L122My workaround right now is to
before calling
write_pandas
, but this is just a hack.