snowflakedb / snowflake-connector-python

Snowflake Connector for Python
https://pypi.python.org/pypi/snowflake-connector-python/
Apache License 2.0
568 stars 456 forks source link

SNOW-170139: CASE_SENSITIVE in write_pandas causes silent failures #329

Closed bmwilly closed 3 years ago

bmwilly commented 4 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using (python --version)?
Python 3.8.2
  1. What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')?
macOS-10.15.5-x86_64-i386-64bit
  1. What are the component versions in the environment (pip list)?
Package                      Version             Location
---------------------------- ------------------- -------------------------------------------------------
agate                        1.6.1
alembic                      1.4.2
amazon-kclpy                 2.0.1
ansiwrap                     0.8.4
appdirs                      1.4.3
appnope                      0.1.0
arrow                        0.15.6
arviz                        0.8.3
asn1crypto                   1.3.0
async-generator              1.10
atpublic                     1.0
attrs                        19.3.0
autograd                     1.3
autograd-gamma               0.4.2
autopep8                     1.5.1
aws-mfa                      0.0.12
awscli                       1.18.75
azure-common                 1.1.25
azure-core                   1.6.0
azure-nspkg                  3.0.2
azure-storage-blob           12.3.2
azure-storage-common         0.37.1
azure-storage-nspkg          3.1.0
b2                           1.4.2
b2sdk                        0.1.8
Babel                        2.8.0
backcall                     0.2.0
binaryornot                  0.4.4
black                        19.10b0
bleach                       3.1.5
boto                         2.49.0
boto3                        1.11.17
botocore                     1.14.17
Brotli                       1.0.7
brotlipy                     0.7.0
cached-property              1.5.1
cachetools                   4.1.0
catboost                     0.22
certifi                      2020.4.5.2
cffi                         1.13.2
cftime                       1.1.3
chardet                      3.0.4
click                        7.1.1
cloudpickle                  1.4.1
cmdstanpy                    0.4.0
colorama                     0.4.3
conda                        4.8.3
conda-package-handling       1.6.0
configobj                    5.0.6
configparser                 5.0.0
convertdate                  2.2.1
convoys                      0.1.7
cookiecutter                 1.7.0
cryptography                 2.9.2
cycler                       0.10.0
Cython                       0.29.20
cytoolz                      0.10.1
dash                         1.12.0
dash-bootstrap-components    0.10.1
dash-core-components         1.10.0
dash-daq                     0.5.0
dash-html-components         1.0.3
dash-renderer                1.4.1
dash-table                   4.7.0
dask                         2.18.1
databricks-cli               0.9.1
decorator                    4.4.2
defusedxml                   0.6.0
distro                       1.4.0
docker                       4.2.1
docker-pycreds               0.4.0
docutils                     0.15.2
dpath                        2.0.1
emcee                        3.0.2
entrypoints                  0.3
ephem                        3.7.7.1
escapism                     1.0.1
fbprophet                    0.6
flake8                       3.7.9
flaky                        3.6.1
Flask                        1.1.2
Flask-Compress               1.5.0
flatten-json                 0.1.7
flufl.lock                   3.2
fsspec                       0.7.4
funcsigs                     1.0.2
funcy                        1.14
future                       0.18.2
futures                      3.1.1
gitdb                        4.0.5
GitPython                    3.1.3
google-api-core              1.17.0
google-api-python-client     1.8.3
google-auth                  1.14.3
google-auth-httplib2         0.0.3
google-auth-oauthlib         0.4.1
googleads                    24.0.0
googleapis-common-protos     1.51.0
gorilla                      0.3.0
grandalf                     0.6
graphviz                     0.13.2
gspread                      3.6.0
gspread-pandas               2.2.3
gunicorn                     20.0.4
holidays                     0.10.2
hologram                     0.0.7
httplib2                     0.17.3
humanize                     2.2.0
idna                         2.9
ijson                        2.6.1
imagecodecs                  2020.5.30
imageio                      2.8.0
importlib-metadata           1.6.1
inflect                      3.0.2
ipdb                         0.12.3
ipykernel                    5.3.0
ipython                      7.15.0
ipython-genutils             0.2.0
ipywidgets                   7.5.1
isodate                      0.6.0
itsdangerous                 1.1.0
jedi                         0.17.0
Jinja2                       2.11.2
jinja2-time                  0.2.0
jmespath                     0.10.0
joblib                       0.15.1
json-rpc                     1.13.0
json5                        0.9.4
jsonpath-ng                  1.5.1
jsonschema                   3.2.0
jupyter-client               6.1.3
jupyter-console              6.1.0
jupyter-core                 4.6.3
jupyter-repo2docker          0.11.0
jupyterlab                   2.1.0
jupyterlab-code-formatter    1.3.1
jupyterlab-server            1.1.5
kafka-python                 1.4.7
keyring                      21.1.1
kiwisolver                   1.2.0
korean-lunar-calendar        0.2.1
leather                      0.3.3
lifelines                    0.24.4
lime                         0.2.0.0
Logbook                      1.5.3
logfury                      0.1.2
LunarCalendar                0.0.9
lxml                         4.5.1
Mako                         1.1.1
MarkupSafe                   1.1.1
matplotlib                   3.2.1
mccabe                       0.6.1
minimal-snowplow-tracker     0.0.2
mistune                      0.8.4
mlflow                       1.7.2
more-itertools               8.4.0
msrest                       0.6.16
mypy                         0.770
mypy-extensions              0.4.3
nanotime                     0.5.2
nb-conda-kernels             2.2.3
nbclient                     0.4.0
nbconvert                    5.6.1
nbdime                       2.0.0
nbformat                     5.0.6
nest-asyncio                 1.3.3
netCDF4                      1.5.3
networkx                     2.4
notebook                     6.0.3
numpy                        1.18.1
oauthlib                     3.1.0
olefile                      0.46
oscrypto                     1.2.0
packaging                    20.4
pandas                       1.0.5
pandocfilters                1.4.2
papermill                    2.1.0
parsedatetime                2.6
parso                        0.7.0
pathspec                     0.8.0
patsy                        0.5.1
pexpect                      4.8.0
pickleshare                  0.7.5
Pillow                       7.1.2
pip                          20.1.1
pipdeptree                   0.13.2
pkginfo                      1.5.0.1
plotly                       4.6.0
pluggy                       0.13.1
ply                          3.11
poyo                         0.5.0
progressbar2                 3.47.0
prometheus-client            0.8.0
prometheus-flask-exporter    0.13.0
prompt-toolkit               3.0.5
protobuf                     3.12.3
psutil                       5.7.0
psycopg2                     2.8.5
psycopg2-binary              2.8.5
ptyprocess                   0.6.0
py                           1.8.2
pyarrow                      0.17.1
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pycodestyle                  2.5.0
pycosat                      0.6.3
pycparser                    2.20
pycryptodomex                3.9.7
pydist-cli                   0.1.1
pydot                        1.4.1
pyflakes                     2.1.1
Pygments                     2.6.1
pygtrie                      2.3.2
pyjavaproperties             0.7
PyJWT                        1.7.1
PyMeeus                      0.3.7
pyOpenSSL                    19.1.0
pyparsing                    2.4.7
pyrsistent                   0.16.0
PySocks                      1.7.1
pystan                       2.19.1.1
pytest                       5.4.1
python-crontab               2.5.1
python-dateutil              2.8.1
python-dotenv                0.13.0
python-editor                1.0.4
python-json-logger           0.1.11
python-slugify               4.0.0
python-utils                 2.3.0
pytimeparse                  1.1.8
pytz                         2020.1
PyWavelets                   1.1.1
PyYAML                       5.3.1
pyzmq                        19.0.1
querystring-parser           1.2.4
readme-renderer              24.0
regex                        2020.6.8
requests                     2.23.0
requests-oauthlib            1.3.0
requests-toolbelt            0.9.1
retrying                     1.3.3
rsa                          3.4.2
ruamel-yaml                  0.15.80
ruamel.yaml                  0.16.10
ruamel.yaml.clib             0.2.0
s3fs                         0.4.2
s3transfer                   0.3.3
scikit-image                 0.17.2
scikit-learn                 0.22.2.post1
scipy                        1.4.1
seaborn                      0.10.0
semver                       2.9.1
Send2Trash                   1.5.0
setuptools                   47.3.1.post20200616
setuptools-git               1.2
shap                         0.35.0
shortuuid                    1.0.1
simplejson                   3.17.0
six                          1.15.0
smmap                        3.0.4
snowflake-connector-python   2.2.7
snowflake-sqlalchemy         1.2.1
SQLAlchemy                   1.3.17
sqlparse                     0.3.1
statsmodels                  0.11.1
tabulate                     0.8.7
tenacity                     6.2.0
terminado                    0.8.3
testpath                     0.4.4
text-unidecode               1.3
texttable                    1.6.2
textwrap3                    0.9.2
tifffile                     2020.6.3
toml                         0.10.1
toolz                        0.10.0
torch                        1.4.0
torchvision                  0.5.0
tornado                      6.0.4
tqdm                         4.46.1
traitlets                    4.3.3
treelib                      1.6.1
twine                        3.1.1
typed-ast                    1.4.1
typing-extensions            3.7.4.2
uritemplate                  3.0.1
urllib3                      1.25.9
voluptuous                   0.11.7
wcwidth                      0.2.4
webencodings                 0.5.1
websocket-client             0.57.0
Werkzeug                     1.0.1
wheel                        0.34.2
whichcraft                   0.6.1
widgetsnbextension           3.5.1
xarray                       0.15.1
xmltodict                    0.12.0
zc.lockfile                  2.0
zeep                         3.4.0
zipp                         3.1.0
  1. What did you do?

Using write_pandas with a DataFrame with lower-cased names causes empty rows to be written to the table. There are no warnings or exceptions. It is due to https://github.com/snowflakedb/snowflake-connector-python/blob/master/pandas_tools.py#L122

image

My workaround right now is to

df = df.copy()
df.columns = [c.upper() for c in df.columns]

before calling write_pandas, but this is just a hack.

sfc-gh-mkeller commented 4 years ago

Yes, it is because of that line you linked. I would like to make things case sensitive as the opposite would mean that some data could be inaccessible by the Python connector.

bmwilly commented 4 years ago

@sfc-gh-mkeller I think it would make sense to convert the names for the user (similar to what you do for the table name here). At the very least, there should be some kind of warning as returning a success from write_pandas but writing empty rows is very confusing.

akshayi1 commented 3 years ago

Oh. My. God! I cannot believe something as silly as this - 1 - actually exists and 2 - is not documented ANYWHERE. I've been banging my head against a wall for a week over this. Turns out, all it needs is a single character to be upper-case and it'll work. Of course, @bmwilly's workaround is preferred, but it just seems really arcane for this to even be a thing. At the very least, if this is not going to be fixed, it should be documented. Right now, there's neither.

Huge thanks to @bmwilly though! :)

akshayi1 commented 3 years ago

I would like to make things case sensitive as the opposite would mean that some data could be inaccessible by the Python connector.

Can this be done in the background? The end-user shouldn't have to suffer.

sfc-gh-mkeller commented 3 years ago

I would like to make things case sensitive as the opposite would mean that some data could be inaccessible by the Python connector.

Can this be done in the background? The end-user shouldn't have to suffer.

I don't see how that would be possible.

However; you are right. I'll document that this API is case sensitive.

akshayi1 commented 3 years ago

@sfc-gh-mkeller - Honestly, I don't fully understand the issue. Column name - "age" (for example) - fails, but "AGE", "Age", "aGe" or any combination where at least one alphabet is capitalized works. In Snowflake, at least in the UI, all column names (by default, and unless explicitly quoted by some process during the table creation) appear in upper case anyways. So, for any column name that is like this - "Age" or "aGe" - it'll work anyways and only "fails" if it is like this - "age". So, I think a safer implementation would be to simply handle it on the fly behind the scenes right?

Besides, the default implementation of to_sql() doesn't really care, and the only reason we'd use pd_writer is to load larger dataframes into Snowflake, because the 16k chunksize limit is just not going to cut it if the dataframe has over a million records. Given the default behavior, this quirk can be seen as an inconsistency. Thoughts?

sfc-gh-mkeller commented 3 years ago

@akshayi1 from https://docs.snowflake.com/en/sql-reference/identifiers-syntax.html#unquoted-identifiers you can see the following:

If an identifier is not enclosed in double quotes, it must begin with a letter or underscore (_) and cannot contain extended characters or blank spaces.

In general, always quoting identifiers is therefore safer. It would also mean that developers are aware what their identifiers really are (because their identifiers wouldn't be silently upper cased).

So, I think a safer implementation would be to simply handle it on the fly behind the scenes right?

We cannot guess correctly all times what a developer means, so why don't we just not guess in the first place and ask the developers to be specific.

My intention was only to give users more power, by allowing them to access all tables and not to overwrite the names they provided to write_pandas. I'm sorry that this didn't work!

I see that my approach to slowly change the connector to be case sensitive is flawed (if it was to be done then it should be a sudden change across the whole connector). I will be reviewing this with my peers and making changes accordingly.

akshayi1 commented 3 years ago

Thank you for the detailed feedback @sfc-gh-mkeller :) As things stand, even the example given in Snowflake's documentation - https://docs.snowflake.com/en/user-guide/python-connector-api.html#pd_writer - does not work. I did see your PR here - https://github.com/snowflakedb/snowflake-connector-python/pull/358 - but this only covers the code. Most people go to the documentation pages first, and that itself a bit out of sync with the underlying issue. Would you or someone in your team be able to please make suitable amends to the doc page? Thank you!

akshayi1 commented 3 years ago

@sfc-gh-mkeller - One more thing to consider, maybe, is this - When devs use to_sql, they don't care about the column naming convention. It just... works. However, the idea with pd_writer is that it can re-use the existing SQLAlchemy engine and merely invoke write_pandas, without having to change any code. But since this comes at the cost of changing column names' casing, it can (and does :)) throw people off.

I know you've acknowledged as much and understand that the idea was to give users the flexibility, and to that end, if it is difficult to, as you said, guess correctly what a developer means. Would a good workaround then be to simply double-quote all column names under the hood? That way, you'd never have to coerce casing of any kind, and the tables end up being whatever the developers want it to be.

akshayi1 commented 3 years ago

Just to illustrate what I mean, this is the example from the documentation page on Snowflake -

import pandas
from snowflake.connector.pandas_tools import pd_writer

# Create a DataFrame containing data about customers
df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance'])

# Specify that the to_sql method should use the pd_writer function
# to write the data from the DataFrame to the table named "customers"
# in the Snowflake database.
df.to_sql('customers', engine, index=False, method=pd_writer)

which won't work, unless you change df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance']) to, at the very least df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['Name', 'Balance']). This what I mean when I say that having even a single alphabet in the column name, regardless of double quoting or not, or where in the string, the alphabet is in upper-case, allows it to be written out as normal, and the NULL issue "goes away".

kyleflan commented 3 years ago

An interim solution for those of us who use Snowflake's default all-uppercase column names who don't want to type out each column name in caps (e.g. for 100+ columns like I have):

# Create a DataFrame containing data about customers
df = pd.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance']).rename(lambda x: x.upper(), axis=1)
df
  NAME BALANCE
0 Mark 10
1 Luke 20
kyleflan commented 3 years ago

Additionally, the case sensitivity issues impact the table name as well, e.g.:

from snowflake.connector.pandas_tools import write_pandas
cnx.execute_string("""create or replace table customers (
    row_id varchar(36) default uuid_string(), 
    ts timestamp default current_timestamp,
    name text,
    balance int);""")

# Create the connection to the Snowflake database.
#cnx = snowflake.connector.connect(...)

# Create a DataFrame containing data about customers
df = pd.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance']).rename(lambda x: x.upper(), axis=1)

# Write the data from the DataFrame to the table named "customers".
success, nchunks, nrows, _ = write_pandas(cnx, df, 'customers')
---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
 in 
     13 
     14 # Write the data from the DataFrame to the table named "customers".
---> 15 success, nchunks, nrows, _ = write_pandas(cnx, df, 'customers')

/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/pandas_tools.py in write_pandas(conn, df, table_name, database, schema, chunk_size, compression, on_error, parallel)
    133     )
    134     logger.debug("copying into with '{}'".format(copy_into_sql))
--> 135     copy_results = cursor.execute(copy_into_sql, _is_internal=True).fetchall()
    136     cursor.close()
    137     return (all((e[1] == 'LOADED' for e in copy_results)),

/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/cursor.py in execute(self, command, params, timeout, _do_reset, _put_callback, _put_azure_callback, _put_callback_output_stream, _get_callback, _get_azure_callback, _get_callback_output_stream, _show_progress_bar, _statement_params, _is_internal, _no_results, _use_ijson, _is_put_get, _raise_put_get_error, _force_put_overwrite)
    592                 'sfqid': self._sfqid
    593             }
--> 594             Error.errorhandler_wrapper(self.connection, self,
    595                                        ProgrammingError,
    596                                        errvalue)

/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py in errorhandler_wrapper(connection, cursor, error_class, error_value)
    121         if cursor is not None:
    122             cursor.messages.append((error_class, error_value))
--> 123             cursor.errorhandler(connection, cursor, error_class, error_value)
    124             return
    125         elif connection is not None:

/data/athena/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py in default_errorhandler(connection, cursor, error_class, error_value)
     81             A Snowflake error.
     82         """
---> 83         raise error_class(
     84             msg=error_value.get('msg'),
     85             errno=error_value.get('errno'),

ProgrammingError: 001757 (42601): SQL compilation error:
Table '"customers"' does not exist

This (kind of)* works:

from snowflake.connector.pandas_tools import write_pandas
cnx.execute_string("""create or replace table customers (
    row_id varchar(36) default uuid_string(), 
    ts timestamp default current_timestamp,
    name text,
    balance int);""")

# Create the connection to the Snowflake database.
#cnx = snowflake.connector.connect(...)

# Create a DataFrame containing data about customers
df = pd.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance']).rename(lambda x: x.upper(), axis=1)

# Write the data from the DataFrame to the table named "customers".
success, nchunks, nrows, _ = write_pandas(cnx, df, 'CUSTOMERS')

*This does not insert default values for the row_id and ts columns. That issue is here: https://github.com/snowflakedb/snowflake-connector-python/issues/361

sfc-gh-mkeller commented 3 years ago

Would you or someone in your team be able to please make suitable amends to the doc page?

I can do this, once we decide on exactly what the fix will be.

I know you've acknowledged as much and understand that the idea was to give users the flexibility, and to that end, if it is difficult to, as you said, guess correctly what a developer means. Would a good workaround then be to simply double-quote all column names under the hood? That way, you'd never have to coerce casing of any kind, and the tables end up being whatever the developers want it to be.

I apologize @akshayi1 , I thought we were talking about table, schema and database names until now.

Yes, so the differences are between write_pandas and snowflake-sqlalchemy.

In this case I'm proposing changing https://github.com/snowflakedb/snowflake-connector-python/blob/0215c0adadbbff6b33dfa32e52a1d0d887f557cf/pandas_tools.py#L128 from CASE_SENSITIVE to CASE_INSENSITIVE (https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#copy-options-copyoptions), but I'm going to need to do some testing first.

The alternative would be to change how snowflake-sqlalchemy works, which would potentially break a lot more customers then changing write_pandas to be insensitive.

akshayi1 commented 3 years ago

Making it "CASE_INSENSITIVE" would probably be for the best :) Hope that works out okay.