snowflakedb / snowpark-python

Snowflake Snowpark Python API
Apache License 2.0
267 stars 109 forks source link

SNOW-992587: Unable to select metadata columns #1185

Closed gnilrets closed 10 months ago

gnilrets commented 10 months ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

3.9.12

  1. What operating system and processor architecture are you using?

Linux-5.10.197-186.748.amzn2.x86_64-x86_64-with-glibc2.31

  1. What are the component versions in the environment (pip freeze)?

absl-py==1.1.0 acryl-datahub==0.12.1.0 adagio==0.2.4 adjustText==0.8 aenum==3.1.11 aeppl==0.0.34 aesara==2.7.9 affine==2.3.0 agate==1.7.1 aiobotocore==2.5.4 aiohttp==3.8.5 aioitertools==0.11.0 aiosignal==1.2.0 alembic==1.11.3 altair==5.1.1 altair-transform==0.2.0 angel-cd==1.0.3 ansi2html==1.8.0 antlr4-python3-runtime==4.11.1 anyio==3.6.2 apify-client==1.6.1 apify-shared==1.1.0 appdirs==1.4.4 argon2-cffi==21.1.0 argparse==1.4.0 arrow==1.2.2 arviz==0.12.1 asn1crypto==1.4.0 astor==0.8.1 astunparse==1.6.3 async-timeout==4.0.2 atomicwrites==1.4.1 atpublic==4.0 attrs==23.1.0 Authlib==1.2.1 autograd==1.5 autograd-gamma==0.5.0 autopep8==1.5.7 avro==1.11.3 avro-gen3==0.7.11 awswrangler==3.3.0 Babel==2.12.1 backcall==0.2.0 backoff==2.1.2 bayesian-optimization==1.2.0 bcrypt==3.2.0 beautifulsoup4==4.10.0 bertopic==0.15.0 bidict==0.22.1 bigframes==0.12.0 bimlpa==0.1.2 biopython==1.80 bitarray==2.5.1 black==22.12.0 bleach==4.1.0 blis==0.7.10 blosc2==2.0.0 bokeh==2.4.3 boto==2.49.0 boto3==1.28.17 botocore==1.31.17 boxsdk==3.9.2 branca==0.4.2 build==0.10.0 cached-property==1.5.2 cachetools==5.3.0 catalogue==2.0.6 catboost==1.2 cattrs==1.10.0 cdlib==0.2.6 certifi==2021.10.8 cffi==1.14.6 cftime==1.5.1 chainladder==0.8.18 charset-normalizer==2.0.6 chinese-whispers==0.8.0 chroma-hnswlib==0.7.1 chromadb==0.4.1 click==8.0.4 click-default-group==1.2.4 click-plugins==1.1.1 click-spinner==0.1.10 clickhouse-driver==0.2.3 cligj==0.7.2 clikit==0.6.2 cloudpickle==2.0.0 cma==3.2.2 cmdstanpy==1.0.4 colorama==0.4.6 colorcet==3.0.0 coloredlogs==15.0.1 colour==0.1.5 comm==0.1.4 confection==0.1.1 cons==0.4.5 contextily==1.1.0 convertdate==2.3.2 convoys==0.2.1 crashtest==0.3.1 cryptography==40.0.2 cvxpy==1.3.1 cycler==0.10.0 cymem==2.0.5 Cython==0.29.24 cytoolz==0.11.2 dash==2.9.2 dash-core-components==2.0.0 dash-html-components==2.0.0 dash-table==5.0.0 dask==2022.4.0 databricks-cli==0.17.7 dataclasses-json==0.5.9 datadog==0.35.0 datatable==1.0.0 db-dtypes==1.1.1 dbt-core==1.7.3 dbt-extractor==0.5.1 dbt-semantic-interfaces==0.4.2 dbt-snowflake==1.7.1 debugpy==1.6.7.post1 decorator==5.1.0 deepdish==0.3.7 defusedxml==0.7.1 demon==2.0.6 Deprecated==1.2.13 descartes==1.1.0 df2gspread==1.0.4 dill==0.3.4 distro==1.8.0 dm-tree==0.1.6 dnspython==2.2.0 docker==6.1.3 docutils==0.16 dropbox==11.36.2 duckdb==0.8.1 dulwich==0.21.7 dynetx==0.3.1 easychart==0.1.16 easypost==5.1.3 easytree==0.1.12 ecos==2.0.12 emcee==3.1.2 emoji==2.4.0 entrypoints==0.3 ephem==4.1 et-xmlfile==1.1.0 eth-abi==4.1.0 eth-account==0.9.0 eth-hash==0.5.2 eth-keyfile==0.6.1 eth-keys==0.4.0 eth-rlp==0.3.0 eth-typing==3.4.0 eth-utils==2.2.0 etils==0.9.0 etuples==0.3.5 eva-lcd==0.1.1 expandvars==0.12.0 fastapi==0.99.1 fastdiff==0.3.0 fastprogress==1.0.0 fbprophet @ git+https://github.com/hex-inc/prophet.git@f7cd9b9fa71f5a941421f540cb10b751e59ae2ba filelock==3.8.0 Fiona==1.8.20 flake8==4.0.1 Flask==2.1.3 flatbuffers==2.0.7 folium==0.12.1.post1 fonttools==4.34.4 formulaic==0.5.2 fredapi==0.5.0 frozenlist==1.3.0 fs==2.4.16 fsspec==2023.9.0 fugue==0.8.3 fugue-sql-antlr==0.1.6 funcy==1.16 future==0.18.2 fuzzywuzzy==0.18.0 gast==0.4.0 gcsfs==2023.9.0 gensim==4.3.2 geographiclib==1.52 geopandas==0.12.2 geopy==2.2.0 gitdb==4.0.9 GitPython==3.1.27 google-api-core==2.11.0 google-api-python-client==1.6.7 google-api-support==0.1.3 google-auth==2.17.1 google-auth-httplib2==0.1.0 google-auth-oauthlib==1.0.0 google-cloud-aiplatform==1.36.0 google-cloud-appengine-logging==1.1.1 google-cloud-audit-log==0.2.0 google-cloud-bigquery==3.13.0 google-cloud-bigquery-connection==1.13.2 google-cloud-bigquery-storage==2.19.1 google-cloud-billing==1.5.1 google-cloud-core==2.1.0 google-cloud-functions==1.13.3 google-cloud-iam==2.12.2 google-cloud-logging==3.0.0 google-cloud-resource-manager==1.10.4 google-cloud-storage==2.1.0 google-crc32c==1.3.0 google-pasta==0.2.0 google-resumable-media==2.0.3 googleapis-common-protos==1.59.0 googlemaps==4.5.3 gql==3.4.0 graphql-core==3.2.0 graphviz==0.20.1 greenlet==1.1.2 gremlinpython==3.6.4 grpc-google-iam-v1==0.12.6 grpcio==1.50.0 grpcio-status==1.41.0 gspread==5.7.1 gspread-dataframe==3.3.0 gspread-pandas==3.2.2 gunicorn==21.2.0 h11==0.14.0 h3==3.7.6 h5py==3.1.0 hdbscan==0.8.33

Editable install with no version control (hex-api==1.0.0)

-e /python-sdk

Editable install with no version control (hex-data-service==0.1.0)

-e /data-service-python

Editable install with no version control (hex-lazy-installer==0.1.0)

-e /hex-lazy-installer

Editable install with no version control (hex-packages==0.1.0)

-e /python-kernel-packages

Editable install with no version control (hex-shared==0.1.0)

-e /python-shared hexbytes==0.2.2

Editable install with no version control (hextoolkit==0.1.0)

-e /python-api hijri-converter==2.2.2 holidays==0.14.2 holoviews==1.15.0 httpcore==1.0.2 httplib2==0.20.1 httpstan==4.8.1 httptools==0.6.0 httpx==0.25.2 hubspot-api-client==8.1.1 huggingface-hub==0.11.1 humanfriendly==10.0 humanize==4.8.0 hyperopt==0.2.7 ibis-framework==6.2.0 idna==3.2 igraph==0.9.11 ijson==3.2.3 imageio==2.9.0 imbalanced-learn==0.9.1 importlib-metadata==6.8.0 importlib-resources==5.10.0 interface-meta==1.3.0 ipykernel==6.25.1 ipython==7.32.0 ipython-genutils==0.2.0 ipywidgets==7.8.1 isodate==0.6.1 itsdangerous==2.1.2 jaraco.classes==3.3.0 jax==0.3.23 jaxlib==0.3.22 jedi==0.17.2 jeepney==0.8.0 Jinja2==3.1.2 jinjasql==0.1.8 jmespath==0.10.0 joblib==1.2.0 jsonpatch==1.33 jsonpath-ng==1.5.3 jsonpointer==2.4 jsonref==1.1.0 jsonschema==4.0.1 jupyter-client==7.0.6 jupyter-dash==0.4.2 jupyter_core==5.3.1 jupyterlab-pygments==0.1.2 jupyterlab-widgets==1.0.2 kaleido==0.2.1 keplergl==0.1.2 keras==2.12.0 Keras-Applications==1.0.8 keyring==24.3.0 kiwisolver==1.3.2 korean-lunar-calendar==0.2.1 langchain==0.0.347 langchain-core==0.0.11 langcodes==3.3.0 langsmith==0.0.69 leather==0.3.4 libclang==14.0.1 lifelines==0.27.4 lightfm==1.17 lightgbm==3.3.5 littleutils==0.2.2 llvmlite==0.40.1 locket==0.2.1 Logbook==1.5.3 logical-unification==0.4.5 loguru==0.6.0 looker-sdk==22.2.1 lru-dict==1.1.7 LunarCalendar==0.0.9 lxml==4.8.0 Mako==1.2.4 Markdown==3.3.4 markdown-it-py==2.2.0 markov-clustering==0.0.6.dev0 MarkupSafe==2.0.1 marshmallow==3.17.0 marshmallow-enum==1.5.1 mashumaro==3.11 matplotlib==3.5.2 matplotlib-inline==0.1.3 matplotlib-venn==0.11.7 mccabe==0.6.1 mdurl==0.1.2 mercantile==1.2.1 miniKanren==1.0.3 minimal-snowplow-tracker==0.0.2 mistune==0.8.4 mixpanel==4.10.0 mizani==0.7.4 mlflow==2.6.0 modelbit==0.29.0 monotonic==1.6 more-itertools==8.10.0 mpmath==1.3.0 msgpack==1.0.5 multidict==6.0.2 multipledispatch==0.6.0 multiprocess==0.70.12.2 multitasking==0.0.11 munch==2.5.0 murmurhash==1.0.5 mypy==0.961 mypy-extensions==0.4.3 natsort==8.4.0 nbclient==0.5.4 nbconvert==6.2.0 nbformat==5.1.3 nest-asyncio==1.5.1 netCDF4==1.5.7 networkx==2.6.3 nevergrad==0.5.0 nf1==0.0.4 nltk==3.7 notebook==6.4.12 numba==0.57.1 numexpr==2.8.5 numpy==1.23.4 numpy-financial==1.0.0 numpyro==0.10.1 oauth2client==4.1.3 oauthlib==3.1.1 onnxruntime==1.15.1 openai==1.3.7 openapi-schema-pydantic==1.2.4 opencv-python==4.8.0.74 openpyxl==3.0.9 opensearch-py==1.1.0 opt-einsum==3.3.0 optbinning==0.17.3 orbit-ml==1.1.4.2 orjson==3.8.9 ortools==9.4.1874 oscrypto==1.2.1 osqp==0.6.2.post9 outdated==0.2.2 overrides==7.4.0 packaging==21.3 palettable==3.3.0 pandas==1.5.3 pandas-flavor==0.6.0 pandas-gbq==0.19.1 pandasql==0.7.3 pandocfilters==1.5.0 panel==0.13.1 param==1.12.2 paramiko==2.10.2 parsedatetime==2.6 parsimonious==0.9.0 parso==0.7.1 parsy==2.1 partd==1.2.0 pastel==0.2.1 pathlib-mate==1.0.1 pathspec==0.9.0 pathy==0.10.2 patsy==0.5.2 pexpect==4.8.0 pg8000==1.29.2 pickleshare==0.7.5 Pillow==9.1.1 pinecone-client==2.2.4 pingouin==0.5.3 pip==23.3.1 pkginfo==1.9.6 platformdirs==3.0.0 plotly==5.13.0 plotly-resampler==0.8.3.2 plotnine==0.9.0 pluggy==0.13.1 ply==3.11 polars==0.19.2 pooch==1.6.0 posthog==3.0.2 preshed==3.0.5 presto-python-client==0.7.0 progressbar2==4.0.0 prometheus-client==0.11.0 prompt-toolkit==3.0.20 prophet==1.1 proto-plus==1.22.1 protobuf==4.24.1 psutil==5.8.0 psycopg2==2.9.1 ptyprocess==0.7.0 PuLP==2.6.0 pulsar-client==3.2.0 py==1.10.0 py-cpuinfo==9.0.0 py4j==0.10.9.7 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.4.8 pyasn1-modules==0.2.8 pyclustering==0.10.1.2 pycodestyle==2.8.0 pycparser==2.20 pycryptodome==3.14.1 pycryptodomex==3.11.0 pyct==0.4.8 pydantic==1.10.12 pydata-google-auth==1.8.2 pydeck==0.7.1 pydot==1.4.2 pyerf==1.0.1 pyflakes==2.4.0 Pygments==2.10.0 pygraphviz==1.10 pygsheets==2.0.5 pyjanitor==0.25.0 PyJWT==2.4.0 pyLDAvis==3.2.2 pylev==1.4.0 pymc==4.1.5 pymc3==3.11.4 PyMeeus==0.5.11 pymer4==0.8.0 pymongo==4.0.1 pymsteams==0.2.2 PyMySQL==1.0.2 PyNaCl==1.5.0 pynmeagps==1.0.20 pynndescent==0.5.4 pyodbc==4.0.32 pyOpenSSL==23.2.0 pyparsing==2.4.7 PyPika==0.48.9 pyproj==3.6.1 pyproject_hooks==1.0.0 pyro-api==0.1.2 pyro-ppl==1.8.4 pyrsistent==0.18.0 pysftp==0.2.9 pyshp==2.1.3 pysimdjson==3.2.0 pystan==3.5.0 pytest==5.4.3 python-box==7.0.1 python-dateutil==2.8.2 python-dotenv==1.0.0 python-igraph==0.9.11 python-json-logger==2.0.7

Editable install with no version control (python-kernel-startup==0.1.0)

-e /python-kernel-startup python-Levenshtein==0.12.2 python-louvain==0.16 python-slugify==8.0.1

Editable install with no version control (python-universal-dataframe==0.1.0)

-e /python-universal-dataframe python-utils==3.3.3 pytimeparse==1.1.8 pytorch-ignite==0.4.6 pytrends==4.7.3 pytz==2023.3.post1 pyunormalize==15.0.0 pyviz-comms==2.2.0 PyWavelets==1.1.1 PyYAML==6.0.1 pyzmq==22.3.0 qdldl==0.1.7 qpd==0.4.0 querystring-parser==1.2.4 rasterio==1.2.9 redmail==0.4.0 redshift-connector==2.0.915 regex==2023.5.5 requests==2.28.0 requests-aws4auth==1.2.3 requests-file==1.5.1 requests-oauthlib==1.3.0 requests-toolbelt==0.9.1 retrying==1.3.4 rich==13.2.0 rlp==3.0.0 ropwr==1.0.0 rpy2==3.5.13 rsa==4.7.2 Rtree==0.9.7 ruamel.yaml==0.17.26 ruamel.yaml.clib==0.2.7 s3fs==2023.9.0 s3transfer==0.6.0 SALib==1.4.7 scikit-image==0.18.3 scikit-learn==1.2.2 scikits.bootstrap==1.1.0 scipy==1.10.1 scramp==1.4.1 scs==3.2.3 seaborn==0.12.1 SecretStorage==3.3.3 selenium==3.141.0 semver==2.13.0 Send2Trash==1.8.0 sentence-transformers==2.2.2 sentencepiece==0.1.97 sentry-sdk==1.39.0 setuptools==68.2.2 setuptools-git==1.2 shap==0.41.0 shapely==2.0.1 sidetable==0.9.0 simple-salesforce==1.11.4 simplejson==3.18.1 six==1.15.0 sklearn-pandas==1.8.0 sktime==0.16.1 slack-sdk==3.18.3 slicer==0.0.7 smart-open==5.2.1 smmap==5.0.0 snapshottest==0.6.0 sniffio==1.3.0 snowflake-connector-python==3.1.0 snowflake-ml-python==1.0.7 snowflake-snowpark-python==1.6.1 snowflake-sqlalchemy==1.4.7 snuggs==1.4.7 sortedcontainers==2.4.0 soupsieve==2.2.1 spacy==3.6.0 spacy-legacy==3.0.12 spacy-loggers==1.0.4 sparse==0.14.0 splunk-sdk==1.6.18 sql-metadata==2.5.0 SQLAlchemy==1.4.25 sqlalchemy-redshift==0.7.9 sqlalchemy2-stubs==0.0.2a32 sqlglot==11.4.5 sqlmodel==0.0.8 sqlparse==0.4.2 srsly==2.4.7 starlette==0.27.0 statsforecast==1.5.0 statsmodels==0.14.0 stone==3.2.1 stripe==2.60.0 styleframe==4.1 sympy==1.12 tableauserverclient==0.24 tables==3.8.0 tabulate==0.8.9 tenacity==8.2.2 tensorboard==2.12.1 tensorboard-data-server==0.7.0 tensorboard-plugin-wit==1.8.0 tensorflow==2.12.0 tensorflow-decision-forests==1.3.0 tensorflow-estimator==2.12.0 tensorflow-hub==0.14.0 tensorflow-io-gcs-filesystem==0.26.0 tensorflow-probability==0.19.0 termcolor==1.1.0 terminado==0.12.1 testpath==0.5.0 text-unidecode==1.3 textblob==0.15.3 texttable==1.6.4 tfcausalimpact==0.0.13 Theano-PyMC==1.1.2 thinc==8.1.10 threadpoolctl==3.0.0 thresholdclustering==1.1 tifffile==2021.8.30 tiktoken==0.5.2 tokenizers==0.13.2 toml==0.10.2 tomli==2.0.1 tomlkit==0.12.1 toolz==0.11.1 torch==1.12.1 torchvision==0.13.1 tornado==6.3.3 tqdm==4.65.0 trace-updater==0.0.9 traitlets==5.9.0 traittypes==0.2.1 transformers==4.25.1 triad==0.8.4 typed-ast==1.5.1 typer==0.4.0 types-cachetools==5.3.0.5 types-pkg-resources==0.1.3 types-PyYAML==6.0.12 types-requests==2.28.2 types-urllib3==1.26.15 typing-inspect==0.9.0 typing_extensions==4.7.1 tzlocal==3.0 ua-parser==0.10.0 ujson==5.4.0 umap-learn==0.5.1 UpSetPlot==0.6.1 uritemplate==3.0.1 urllib3==1.26.16 user-agents==2.2.0 uszipcode==0.2.6 uvicorn==0.23.2 uvloop==0.17.0 validators==0.20.0 vegafusion==1.5.0 vegafusion-python-embed==1.5.0 vl-convert-python==0.13.1 wasabi==1.1.2 wasmer==1.1.0 wasmer_compiler_cranelift==1.1.0 watchfiles==0.19.0 wcwidth==0.2.5 web3==6.8.0 webargs==8.2.0 webencodings==0.5.1 websocket-client==1.6.1 websockets==11.0.3 Werkzeug==2.0.2 wheel==0.38.4 widgetsnbextension==3.6.6 wordcloud==1.8.1 wrapt==1.12.1 wurlitzer==3.0.2 xarray==0.19.0 xarray-einstats==0.3.0 xgboost==1.7.3 xlrd==2.0.1 xxhash==3.4.1 yarl==1.7.2 yfinance==0.1.87 youtube-data-api==0.0.21 zipp==3.10.0 zstandard==0.21.0

  1. What did you do?

I have a stage with a few CSV files in it. I wanted to create a Snowpark dataframe that read the CSV files and placed the filename as a column in the dataframe. I was then trying to use the filename column downstream but I got a lot of strange errors.

Here's what I started with:

import snowflake.snowpark.functions as F  # noqa: N812
import snowflake.snowpark.types as SPT  # noqa: N812

csv_spdf = (
    session.read
    .options({
        "skip_header": 21
    })
    .with_metadata(F.col("metadata$filename").alias("FILENAME"))
    .schema(
        SPT.StructType([
            SPT.StructField("TIME", SPT.FloatType()),
            SPT.StructField("CH1", SPT.FloatType()),
            SPT.StructField("CH2", SPT.FloatType()),
        ])
    )
    .csv("@DEV_STERLINGP_RAW.PUBLIC.POC")
)

When I convert it to a pandas dataframe and look at the columns, I see everything I expect:

csv_spdf.limit(100).to_pandas()
# Index(['FILENAME', 'TIME', 'CH1', 'CH2'], dtype='object')

However, if I look at the columns in the Snowpark dataframe, the "FILENAME" column doesn't exist:

csv_spdf.columns
# ['TIME', 'CH1', 'CH2']

If I include the FILENAME in the schema section (SPT.StructField("FILENAME", SPT.StringType())), then it does show up in the csv_spdf.columns. However, when I try to convert it to a pandas dataframe I get an error that there are duplicate columns: SnowparkFetchDataException: (1406): Failed to fetch a Pandas Dataframe. The error is: Found non-unique column index

Some other strange things that happen here is that if I try to group by the FILENAME column, it can execute the query and return the results:

# This works and I get the expected pandas dataframe:
(
    csv_spdf
    .group_by(F.col("FILENAME"))
    .count()
).to_pandas()

However, if I try to ask for the columns that are in the above dataframe expression, I get an error:

(
    csv_spdf
    .group_by(F.col("FILENAME"))
    .count()
).columns
# Returns SQL compilation error: error line 1 at position 7 invalid identifier 'FILENAME'
sfc-gh-aalam commented 10 months ago

@gnilrets can you retry using snowpark-python version 1.11.1. A bug similar to this was addressed in the latest release

gnilrets commented 10 months ago

Yep, that did it. Thanks!