snowflakedb / snowflake-connector-python

Snowflake Connector for Python
https://pypi.python.org/pypi/snowflake-connector-python/
Apache License 2.0
579 stars 467 forks source link

SNOW-542421: Parallel Fetch with fetch_pandas_all results in duplicate index values. #1061

Closed jacksonrnewhouse closed 2 years ago

jacksonrnewhouse commented 2 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

Python 3.7.5 (default, Dec 9 2021, 17:04:37) [GCC 8.4.0]

  1. What operating system and processor architecture are you using?

Linux-5.4.117-58.216.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic

  1. What are the component versions in the environment (pip freeze)?

Package Version


absl-py 1.0.0 aerospike 4.0.0 aiobotocore 1.2.1 aiohttp 3.8.1 aioitertools 0.8.0 aiosignal 1.2.0 alembic 1.7.5 ansiwrap 0.8.4 anyio 3.5.0 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asn1crypto 1.4.0 astor 0.8.1 astroid 2.5 async-generator 1.10 async-timeout 4.0.2 asynctest 0.13.0 attrs 20.3.0 Authlib 0.15.5 autopep8 1.5.5 awscli 1.18.212 azure-common 1.1.26 azure-core 1.21.1 azure-storage-blob 12.7.1 backcall 0.2.0 banal 1.0.6 beautifulsoup4 4.9.3 behave 1.2.6 behave-pandas 0.3.0 bitarray 1.6.3 black 21.12b0 bleach 4.1.0 blis 0.4.1 bokeh 2.2.2 boto3 1.16.52 botocore 1.19.52 bqplot 0.12.18 branca 0.3.1 brandprotobuf 3.6.2 Brotli 1.0.9 bs4 0.0.1 cached-property 1.5.2 cachetools 4.2.4 certifi 2021.10.8 certipy 0.1.3 cffi 1.15.0 chardet 3.0.4 charset-normalizer 2.0.10 chart-studio 1.0.0 click 7.1.2 cloudpickle 2.0.0 cmdstanpy 0.9.5 colorama 0.4.3 colorlover 0.3.0 convertdate 2.3.2 creative-lifecycle-manager 1.119.0 cron-descriptor 1.2.24 cryptography 35.0.0 cufflinks 0.17.3 cycler 0.11.0 cymem 2.0.6 Cython 0.29.26 dash 2.0.0 dash-bootstrap-components 0.10.7 dash-core-components 2.0.0 dash-html-components 2.0.0 dash-qc-components 1.0.0 dash-table 5.0.0 dask 2022.1.0 dask-kubernetes 0.11.0 datadog 0.26.0 dataset 1.5.2 dateparser 0.7.0 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 Deprecated 1.2.13 df2gspread 1.0.4 distributed 2022.1.0 distro 1.6.0 docutils 0.15.2 elasticsearch 7.9.1 entrypoints 0.3 ephem 4.1.3 experimentfr-analytics 0.1.81 experimentfr-grpc 0.1.81 falcon 3.0.1 fbprophet 0.7.1 findspark 1.4.2 flake8 3.8.3 Flask 1.1.4 Flask-Caching 1.9.0 Flask-Compress 1.10.1 freezegun 1.1.0 frozenlist 1.3.0 fsspec 0.8.7 future 0.18.2 futures 3.1.1 gast 0.2.2 gitdb 4.0.9 GitPython 3.1.26 Glances 3.1.5 google-api-python-client 1.6.7 google-auth 1.35.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 googleapis-common-protos 1.6.0 grpc-graphql-gateway-proto 0.46.0 grpcio 1.30.0 grpcio-health-checking 1.30.0 grpcio-status 1.30.0 grpcio-tools 1.30.0 gspread 5.1.1 gunicorn 20.1.0 h5py 3.6.0 HeapDict 1.0.1 hijri-converter 2.2.2 holidays 0.12 httplib2 0.20.2 hvac 0.9.6 idna 3.3 importlib-metadata 4.10.1 importlib-resources 5.4.0 iniconfig 1.1.1 invoke 1.6.0 ipydatawidgets 4.2.0 ipykernel 6.7.0 ipyleaflet 0.11.2 ipython 7.31.0 ipython-genutils 0.2.0 ipywidgets 7.6.5 isodate 0.6.1 isort 5.10.1 itsdangerous 1.1.0 jedi 0.17.2 Jinja2 2.11.3 jmespath 0.10.0 joblib 1.1.0 json5 0.9.6 jsonschema 4.4.0 jupyter-client 7.1.1 jupyter-console 6.2.0 jupyter-contrib-core 0.3.3 jupyter-contrib-nbextensions 0.5.1 jupyter-core 4.6.3 jupyter-highlight-selected-word 0.2.0 jupyter-kernel-gateway 2.4.3 jupyter-latex-envs 1.4.6 jupyter-lsp 0.9.2 jupyter-nbextensions-configurator 0.4.1 jupyter-server 1.13.3 jupyter-server-proxy 3.2.0 jupyter-telemetry 0.1.0 jupyterhub 2.0.2 jupyterlab 2.2.8 jupyterlab-code-formatter 1.3.6 jupyterlab-dash 0.1.0a3 jupyterlab-git 0.22.1 jupyterlab-iframe 0.2.3 jupyterlab-launcher 0.13.1 jupyterlab-pygments 0.1.2 jupyterlab-server 1.2.0 jupyterlab-widgets 1.0.2 jupytext 1.6.0 kazoo 2.8.0 kazurator 0.2.0 Keras 2.3.1 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 kgrpc 1.0.13 kiwisolver 1.3.2 koalas 1.3.0 korean-lunar-calendar 0.2.1 kubernetes 17.17.0 kubernetes-asyncio 19.15.0 lazy-object-proxy 1.7.1 ldap3 2.8.1 llvmlite 0.32.1 locket 0.2.1 LunarCalendar 0.0.9 lxml 4.7.1 Mako 1.1.6 Markdown 3.3.6 markdown-it-py 0.5.8 MarkupSafe 2.0.1 matplotlib 3.2.2 matplotlib-inline 0.1.3 mccabe 0.6.1 metadata-parser 0.9.23 mistune 0.8.4 mmh3 2.5.1 mpmath 1.2.1 msgpack 1.0.3 msrest 0.6.21 multidict 5.2.0 murmurhash 1.0.6 mypy-extensions 0.4.3 nbclient 0.5.10 nbconvert 6.4.0 nbdime 2.1.1 nbformat 5.1.3 nbresuse 0.3.6 nest-asyncio 1.5.4 nltk 3.4.1 notebook 6.4.7 notification-center 0.12.0 numba 0.49.1 numpy 1.18.5 oauth2client 4.1.3 oauthenticator 0.8.0 oauthlib 3.1.1 opt-einsum 3.3.0 oscrypto 1.2.1 packaging 21.3 pamela 1.0.0 pandas 1.1.5 pandocfilters 1.5.0 papermill 2.3.0 parse 1.19.0 parse-type 0.5.2 parso 0.7.1 partd 0.3.10 pathspec 0.9.0 patsy 0.5.2 perspective-dash-component 0.0.7 perspective-python 1.1.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.0.0 pip 21.3.1 pipdeptree 2.0.0 plac 1.1.3 platformdirs 2.4.1 plotly 5.5.0 plotly-express 0.4.1 pluggy 1.0.0 preshed 3.0.6 prometheus-client 0.12.0 prompt-toolkit 3.0.24 protobuf 3.13.0 psutil 5.9.0 psycopg2-binary 2.8.6 ptyprocess 0.7.0 py 1.11.0 py4j 0.10.9 pyarrow 6.0.1 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybuilder 0.12.6 pycodestyle 2.6.0 pycparser 2.21 pycryptodomex 3.10.1 pycurl 7.44.1 pydocstyle 5.1.1 pyflakes 2.2.0 PyGithub 1.55 Pygments 2.11.2 PyGObject 3.26.1 PyHamcrest 1.9.0 PyJWT 2.3.0 pylint 2.7.1 pymc3 3.5 PyMeeus 0.5.11 PyNaCl 1.5.0 pyOpenSSL 21.0.0 pyparsing 3.0.6 pyrsistent 0.18.1 PySocks 1.6.8 pyspark 3.1.2 pystan 2.19.0.0 pytest 6.2.5 python-apt 1.6.5+ubuntu0.7 python-dateutil 2.8.2 python-json-logger 2.0.2 python-jsonrpc-server 0.4.0 python-language-server 0.35.1 python-oauth2 1.1.0 python-pptx 0.6.18 python-snappy 0.6.0 pythreejs 2.2.1 pytz 2021.3 PyYAML 5.3.1 pyzmq 22.3.0 regex 2020.11.13 requests 2.27.1 requests-futures 1.0.0 requests-oauthlib 1.3.0 requests-toolbelt 0.9.1 retry 0.9.2 retrying 1.3.3 rope 0.18.0 rsa 4.5 rtb-deployer-client 1.0b254 rtb-deployer-grpc 1.0b129 ruamel.yaml 0.17.20 ruamel.yaml.clib 0.2.6 s3fs 0.4.2 s3transfer 0.3.7 scikit-learn 1.0.2 scipy 1.4.1 seaborn 0.11.0 Send2Trash 1.8.0 setuptools 60.8.1 setuptools-git 1.2 simpervisor 0.4 simple-salesforce 1.10.1 simplegeneric 0.8.1 six 1.16.0 slacker 0.14.0 smart-open 5.0.0 smmap 5.0.0 sniffio 1.2.0 snowballstemmer 2.2.0 snowflake-connector-python 2.7.2 snowflake-sqlalchemy 1.2.4 sortedcontainers 2.4.0 soupsieve 2.3.1 spacy 2.2.2 sparkhub-client 1.0.235 sparkmeasure 0.14.0 sparkmonitor 1.1.0 SQLAlchemy 1.3.23 sqlparse 0.2.4 srsly 1.0.5 ssh-import-id 5.11 statsmodels 0.13.1 survey-configuration-service 0.28.0 sympy 1.3 tabulate 0.8.9 tblib 1.7.0 tenacity 8.0.1 tensorboard 2.1.1 tensorflow 2.1.0 tensorflow-estimator 2.1.0 tensorflow-gpu 2.1.0 termcolor 1.1.0 terminado 0.12.1 testpath 0.5.0 textwrap3 0.9.2 Theano 1.0.5 thinc 7.3.1 thirdparty-protoc-gen-validate 0.3.0 threadpoolctl 3.0.0 timeago 1.0.15 tinys3 0.1.12 toml 0.10.2 tomli 1.2.3 toolz 0.11.2 toposort 1.6 torch 1.6.0 torchtext 0.2.3 tornado 6.1 tornado-proxy-handlers 0.0.5 tqdm 4.62.3 traitlets 5.1.1 traittypes 0.2.1 typed-ast 1.4.3 typing_extensions 4.0.1 tzlocal 2.1 ua-parser 0.8.0 ujson 5.1.0 unattended-upgrades 0.1 uritemplate 3.0.1 urllib3 1.26.8 wasabi 0.9.0 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 0.53.0 Werkzeug 1.0.1 wheel 0.37.1 widgetsnbextension 3.5.2 wrapt 1.12.1 xarray 0.20.2 xeus-python 0.11.2 xgboost 1.2.1 XlsxWriter 3.0.2 xlwt 1.3.0 yapf 0.30.0 yarl 1.7.2 zict 2.0.0 zipp 3.7.0 zope.interface 4.7.2

  1. What did you do?

ran

import snowflake
from snowflake.connector import DictCursor

conn = snowflake.connector.connect(**CONNECTION)
cursor = conn.cursor(DictCursor)
result = cursor.execute("select * FROM big_table")
df = result.fetch_pandas_all()
print(len(df.index))
# 298505
print(len(set(df.index))
# 16628
print(df[df.index==1])
# 44
print(df[df.index==2])
# 44
  1. What did you expect to see?

I expect the resulting pandas dataframe to have a non-duplicate index, as downstream processing expects this to be the case. This change was almost surely introduced in #787, which was rolled out in 2.6. It manifested for me when we bumped from 2.4.0 to 2.7.2. It can be mitigated by calling reset_index(drop=True) on the resulting dataframe, but it definitely was an unexpected deviation from past behavior.

sfc-gh-mkeller commented 2 years ago

This should have been fixed by #1068 Please reopen, if this is not the case