I have a stage with a few CSV files in it. I wanted to create a Snowpark dataframe that read the CSV files and placed the filename as a column in the dataframe. I was then trying to use the filename column downstream but I got a lot of strange errors.
However, if I look at the columns in the Snowpark dataframe, the "FILENAME" column doesn't exist:
csv_spdf.columns
# ['TIME', 'CH1', 'CH2']
If I include the FILENAME in the schema section (SPT.StructField("FILENAME", SPT.StringType())), then it does show up in the csv_spdf.columns. However, when I try to convert it to a pandas dataframe I get an error that there are duplicate columns: SnowparkFetchDataException: (1406): Failed to fetch a Pandas Dataframe. The error is: Found non-unique column index
Some other strange things that happen here is that if I try to group by the FILENAME column, it can execute the query and return the results:
# This works and I get the expected pandas dataframe:
(
csv_spdf
.group_by(F.col("FILENAME"))
.count()
).to_pandas()
However, if I try to ask for the columns that are in the above dataframe expression, I get an error:
(
csv_spdf
.group_by(F.col("FILENAME"))
.count()
).columns
# Returns SQL compilation error: error line 1 at position 7 invalid identifier 'FILENAME'
Please answer these questions before submitting your issue. Thanks!
3.9.12
Linux-5.10.197-186.748.amzn2.x86_64-x86_64-with-glibc2.31
pip freeze
)?absl-py==1.1.0 acryl-datahub==0.12.1.0 adagio==0.2.4 adjustText==0.8 aenum==3.1.11 aeppl==0.0.34 aesara==2.7.9 affine==2.3.0 agate==1.7.1 aiobotocore==2.5.4 aiohttp==3.8.5 aioitertools==0.11.0 aiosignal==1.2.0 alembic==1.11.3 altair==5.1.1 altair-transform==0.2.0 angel-cd==1.0.3 ansi2html==1.8.0 antlr4-python3-runtime==4.11.1 anyio==3.6.2 apify-client==1.6.1 apify-shared==1.1.0 appdirs==1.4.4 argon2-cffi==21.1.0 argparse==1.4.0 arrow==1.2.2 arviz==0.12.1 asn1crypto==1.4.0 astor==0.8.1 astunparse==1.6.3 async-timeout==4.0.2 atomicwrites==1.4.1 atpublic==4.0 attrs==23.1.0 Authlib==1.2.1 autograd==1.5 autograd-gamma==0.5.0 autopep8==1.5.7 avro==1.11.3 avro-gen3==0.7.11 awswrangler==3.3.0 Babel==2.12.1 backcall==0.2.0 backoff==2.1.2 bayesian-optimization==1.2.0 bcrypt==3.2.0 beautifulsoup4==4.10.0 bertopic==0.15.0 bidict==0.22.1 bigframes==0.12.0 bimlpa==0.1.2 biopython==1.80 bitarray==2.5.1 black==22.12.0 bleach==4.1.0 blis==0.7.10 blosc2==2.0.0 bokeh==2.4.3 boto==2.49.0 boto3==1.28.17 botocore==1.31.17 boxsdk==3.9.2 branca==0.4.2 build==0.10.0 cached-property==1.5.2 cachetools==5.3.0 catalogue==2.0.6 catboost==1.2 cattrs==1.10.0 cdlib==0.2.6 certifi==2021.10.8 cffi==1.14.6 cftime==1.5.1 chainladder==0.8.18 charset-normalizer==2.0.6 chinese-whispers==0.8.0 chroma-hnswlib==0.7.1 chromadb==0.4.1 click==8.0.4 click-default-group==1.2.4 click-plugins==1.1.1 click-spinner==0.1.10 clickhouse-driver==0.2.3 cligj==0.7.2 clikit==0.6.2 cloudpickle==2.0.0 cma==3.2.2 cmdstanpy==1.0.4 colorama==0.4.6 colorcet==3.0.0 coloredlogs==15.0.1 colour==0.1.5 comm==0.1.4 confection==0.1.1 cons==0.4.5 contextily==1.1.0 convertdate==2.3.2 convoys==0.2.1 crashtest==0.3.1 cryptography==40.0.2 cvxpy==1.3.1 cycler==0.10.0 cymem==2.0.5 Cython==0.29.24 cytoolz==0.11.2 dash==2.9.2 dash-core-components==2.0.0 dash-html-components==2.0.0 dash-table==5.0.0 dask==2022.4.0 databricks-cli==0.17.7 dataclasses-json==0.5.9 datadog==0.35.0 datatable==1.0.0 db-dtypes==1.1.1 dbt-core==1.7.3 dbt-extractor==0.5.1 dbt-semantic-interfaces==0.4.2 dbt-snowflake==1.7.1 debugpy==1.6.7.post1 decorator==5.1.0 deepdish==0.3.7 defusedxml==0.7.1 demon==2.0.6 Deprecated==1.2.13 descartes==1.1.0 df2gspread==1.0.4 dill==0.3.4 distro==1.8.0 dm-tree==0.1.6 dnspython==2.2.0 docker==6.1.3 docutils==0.16 dropbox==11.36.2 duckdb==0.8.1 dulwich==0.21.7 dynetx==0.3.1 easychart==0.1.16 easypost==5.1.3 easytree==0.1.12 ecos==2.0.12 emcee==3.1.2 emoji==2.4.0 entrypoints==0.3 ephem==4.1 et-xmlfile==1.1.0 eth-abi==4.1.0 eth-account==0.9.0 eth-hash==0.5.2 eth-keyfile==0.6.1 eth-keys==0.4.0 eth-rlp==0.3.0 eth-typing==3.4.0 eth-utils==2.2.0 etils==0.9.0 etuples==0.3.5 eva-lcd==0.1.1 expandvars==0.12.0 fastapi==0.99.1 fastdiff==0.3.0 fastprogress==1.0.0 fbprophet @ git+https://github.com/hex-inc/prophet.git@f7cd9b9fa71f5a941421f540cb10b751e59ae2ba filelock==3.8.0 Fiona==1.8.20 flake8==4.0.1 Flask==2.1.3 flatbuffers==2.0.7 folium==0.12.1.post1 fonttools==4.34.4 formulaic==0.5.2 fredapi==0.5.0 frozenlist==1.3.0 fs==2.4.16 fsspec==2023.9.0 fugue==0.8.3 fugue-sql-antlr==0.1.6 funcy==1.16 future==0.18.2 fuzzywuzzy==0.18.0 gast==0.4.0 gcsfs==2023.9.0 gensim==4.3.2 geographiclib==1.52 geopandas==0.12.2 geopy==2.2.0 gitdb==4.0.9 GitPython==3.1.27 google-api-core==2.11.0 google-api-python-client==1.6.7 google-api-support==0.1.3 google-auth==2.17.1 google-auth-httplib2==0.1.0 google-auth-oauthlib==1.0.0 google-cloud-aiplatform==1.36.0 google-cloud-appengine-logging==1.1.1 google-cloud-audit-log==0.2.0 google-cloud-bigquery==3.13.0 google-cloud-bigquery-connection==1.13.2 google-cloud-bigquery-storage==2.19.1 google-cloud-billing==1.5.1 google-cloud-core==2.1.0 google-cloud-functions==1.13.3 google-cloud-iam==2.12.2 google-cloud-logging==3.0.0 google-cloud-resource-manager==1.10.4 google-cloud-storage==2.1.0 google-crc32c==1.3.0 google-pasta==0.2.0 google-resumable-media==2.0.3 googleapis-common-protos==1.59.0 googlemaps==4.5.3 gql==3.4.0 graphql-core==3.2.0 graphviz==0.20.1 greenlet==1.1.2 gremlinpython==3.6.4 grpc-google-iam-v1==0.12.6 grpcio==1.50.0 grpcio-status==1.41.0 gspread==5.7.1 gspread-dataframe==3.3.0 gspread-pandas==3.2.2 gunicorn==21.2.0 h11==0.14.0 h3==3.7.6 h5py==3.1.0 hdbscan==0.8.33
Editable install with no version control (hex-api==1.0.0)
-e /python-sdk
Editable install with no version control (hex-data-service==0.1.0)
-e /data-service-python
Editable install with no version control (hex-lazy-installer==0.1.0)
-e /hex-lazy-installer
Editable install with no version control (hex-packages==0.1.0)
-e /python-kernel-packages
Editable install with no version control (hex-shared==0.1.0)
-e /python-shared hexbytes==0.2.2
Editable install with no version control (hextoolkit==0.1.0)
-e /python-api hijri-converter==2.2.2 holidays==0.14.2 holoviews==1.15.0 httpcore==1.0.2 httplib2==0.20.1 httpstan==4.8.1 httptools==0.6.0 httpx==0.25.2 hubspot-api-client==8.1.1 huggingface-hub==0.11.1 humanfriendly==10.0 humanize==4.8.0 hyperopt==0.2.7 ibis-framework==6.2.0 idna==3.2 igraph==0.9.11 ijson==3.2.3 imageio==2.9.0 imbalanced-learn==0.9.1 importlib-metadata==6.8.0 importlib-resources==5.10.0 interface-meta==1.3.0 ipykernel==6.25.1 ipython==7.32.0 ipython-genutils==0.2.0 ipywidgets==7.8.1 isodate==0.6.1 itsdangerous==2.1.2 jaraco.classes==3.3.0 jax==0.3.23 jaxlib==0.3.22 jedi==0.17.2 jeepney==0.8.0 Jinja2==3.1.2 jinjasql==0.1.8 jmespath==0.10.0 joblib==1.2.0 jsonpatch==1.33 jsonpath-ng==1.5.3 jsonpointer==2.4 jsonref==1.1.0 jsonschema==4.0.1 jupyter-client==7.0.6 jupyter-dash==0.4.2 jupyter_core==5.3.1 jupyterlab-pygments==0.1.2 jupyterlab-widgets==1.0.2 kaleido==0.2.1 keplergl==0.1.2 keras==2.12.0 Keras-Applications==1.0.8 keyring==24.3.0 kiwisolver==1.3.2 korean-lunar-calendar==0.2.1 langchain==0.0.347 langchain-core==0.0.11 langcodes==3.3.0 langsmith==0.0.69 leather==0.3.4 libclang==14.0.1 lifelines==0.27.4 lightfm==1.17 lightgbm==3.3.5 littleutils==0.2.2 llvmlite==0.40.1 locket==0.2.1 Logbook==1.5.3 logical-unification==0.4.5 loguru==0.6.0 looker-sdk==22.2.1 lru-dict==1.1.7 LunarCalendar==0.0.9 lxml==4.8.0 Mako==1.2.4 Markdown==3.3.4 markdown-it-py==2.2.0 markov-clustering==0.0.6.dev0 MarkupSafe==2.0.1 marshmallow==3.17.0 marshmallow-enum==1.5.1 mashumaro==3.11 matplotlib==3.5.2 matplotlib-inline==0.1.3 matplotlib-venn==0.11.7 mccabe==0.6.1 mdurl==0.1.2 mercantile==1.2.1 miniKanren==1.0.3 minimal-snowplow-tracker==0.0.2 mistune==0.8.4 mixpanel==4.10.0 mizani==0.7.4 mlflow==2.6.0 modelbit==0.29.0 monotonic==1.6 more-itertools==8.10.0 mpmath==1.3.0 msgpack==1.0.5 multidict==6.0.2 multipledispatch==0.6.0 multiprocess==0.70.12.2 multitasking==0.0.11 munch==2.5.0 murmurhash==1.0.5 mypy==0.961 mypy-extensions==0.4.3 natsort==8.4.0 nbclient==0.5.4 nbconvert==6.2.0 nbformat==5.1.3 nest-asyncio==1.5.1 netCDF4==1.5.7 networkx==2.6.3 nevergrad==0.5.0 nf1==0.0.4 nltk==3.7 notebook==6.4.12 numba==0.57.1 numexpr==2.8.5 numpy==1.23.4 numpy-financial==1.0.0 numpyro==0.10.1 oauth2client==4.1.3 oauthlib==3.1.1 onnxruntime==1.15.1 openai==1.3.7 openapi-schema-pydantic==1.2.4 opencv-python==4.8.0.74 openpyxl==3.0.9 opensearch-py==1.1.0 opt-einsum==3.3.0 optbinning==0.17.3 orbit-ml==1.1.4.2 orjson==3.8.9 ortools==9.4.1874 oscrypto==1.2.1 osqp==0.6.2.post9 outdated==0.2.2 overrides==7.4.0 packaging==21.3 palettable==3.3.0 pandas==1.5.3 pandas-flavor==0.6.0 pandas-gbq==0.19.1 pandasql==0.7.3 pandocfilters==1.5.0 panel==0.13.1 param==1.12.2 paramiko==2.10.2 parsedatetime==2.6 parsimonious==0.9.0 parso==0.7.1 parsy==2.1 partd==1.2.0 pastel==0.2.1 pathlib-mate==1.0.1 pathspec==0.9.0 pathy==0.10.2 patsy==0.5.2 pexpect==4.8.0 pg8000==1.29.2 pickleshare==0.7.5 Pillow==9.1.1 pinecone-client==2.2.4 pingouin==0.5.3 pip==23.3.1 pkginfo==1.9.6 platformdirs==3.0.0 plotly==5.13.0 plotly-resampler==0.8.3.2 plotnine==0.9.0 pluggy==0.13.1 ply==3.11 polars==0.19.2 pooch==1.6.0 posthog==3.0.2 preshed==3.0.5 presto-python-client==0.7.0 progressbar2==4.0.0 prometheus-client==0.11.0 prompt-toolkit==3.0.20 prophet==1.1 proto-plus==1.22.1 protobuf==4.24.1 psutil==5.8.0 psycopg2==2.9.1 ptyprocess==0.7.0 PuLP==2.6.0 pulsar-client==3.2.0 py==1.10.0 py-cpuinfo==9.0.0 py4j==0.10.9.7 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.4.8 pyasn1-modules==0.2.8 pyclustering==0.10.1.2 pycodestyle==2.8.0 pycparser==2.20 pycryptodome==3.14.1 pycryptodomex==3.11.0 pyct==0.4.8 pydantic==1.10.12 pydata-google-auth==1.8.2 pydeck==0.7.1 pydot==1.4.2 pyerf==1.0.1 pyflakes==2.4.0 Pygments==2.10.0 pygraphviz==1.10 pygsheets==2.0.5 pyjanitor==0.25.0 PyJWT==2.4.0 pyLDAvis==3.2.2 pylev==1.4.0 pymc==4.1.5 pymc3==3.11.4 PyMeeus==0.5.11 pymer4==0.8.0 pymongo==4.0.1 pymsteams==0.2.2 PyMySQL==1.0.2 PyNaCl==1.5.0 pynmeagps==1.0.20 pynndescent==0.5.4 pyodbc==4.0.32 pyOpenSSL==23.2.0 pyparsing==2.4.7 PyPika==0.48.9 pyproj==3.6.1 pyproject_hooks==1.0.0 pyro-api==0.1.2 pyro-ppl==1.8.4 pyrsistent==0.18.0 pysftp==0.2.9 pyshp==2.1.3 pysimdjson==3.2.0 pystan==3.5.0 pytest==5.4.3 python-box==7.0.1 python-dateutil==2.8.2 python-dotenv==1.0.0 python-igraph==0.9.11 python-json-logger==2.0.7
Editable install with no version control (python-kernel-startup==0.1.0)
-e /python-kernel-startup python-Levenshtein==0.12.2 python-louvain==0.16 python-slugify==8.0.1
Editable install with no version control (python-universal-dataframe==0.1.0)
-e /python-universal-dataframe python-utils==3.3.3 pytimeparse==1.1.8 pytorch-ignite==0.4.6 pytrends==4.7.3 pytz==2023.3.post1 pyunormalize==15.0.0 pyviz-comms==2.2.0 PyWavelets==1.1.1 PyYAML==6.0.1 pyzmq==22.3.0 qdldl==0.1.7 qpd==0.4.0 querystring-parser==1.2.4 rasterio==1.2.9 redmail==0.4.0 redshift-connector==2.0.915 regex==2023.5.5 requests==2.28.0 requests-aws4auth==1.2.3 requests-file==1.5.1 requests-oauthlib==1.3.0 requests-toolbelt==0.9.1 retrying==1.3.4 rich==13.2.0 rlp==3.0.0 ropwr==1.0.0 rpy2==3.5.13 rsa==4.7.2 Rtree==0.9.7 ruamel.yaml==0.17.26 ruamel.yaml.clib==0.2.7 s3fs==2023.9.0 s3transfer==0.6.0 SALib==1.4.7 scikit-image==0.18.3 scikit-learn==1.2.2 scikits.bootstrap==1.1.0 scipy==1.10.1 scramp==1.4.1 scs==3.2.3 seaborn==0.12.1 SecretStorage==3.3.3 selenium==3.141.0 semver==2.13.0 Send2Trash==1.8.0 sentence-transformers==2.2.2 sentencepiece==0.1.97 sentry-sdk==1.39.0 setuptools==68.2.2 setuptools-git==1.2 shap==0.41.0 shapely==2.0.1 sidetable==0.9.0 simple-salesforce==1.11.4 simplejson==3.18.1 six==1.15.0 sklearn-pandas==1.8.0 sktime==0.16.1 slack-sdk==3.18.3 slicer==0.0.7 smart-open==5.2.1 smmap==5.0.0 snapshottest==0.6.0 sniffio==1.3.0 snowflake-connector-python==3.1.0 snowflake-ml-python==1.0.7 snowflake-snowpark-python==1.6.1 snowflake-sqlalchemy==1.4.7 snuggs==1.4.7 sortedcontainers==2.4.0 soupsieve==2.2.1 spacy==3.6.0 spacy-legacy==3.0.12 spacy-loggers==1.0.4 sparse==0.14.0 splunk-sdk==1.6.18 sql-metadata==2.5.0 SQLAlchemy==1.4.25 sqlalchemy-redshift==0.7.9 sqlalchemy2-stubs==0.0.2a32 sqlglot==11.4.5 sqlmodel==0.0.8 sqlparse==0.4.2 srsly==2.4.7 starlette==0.27.0 statsforecast==1.5.0 statsmodels==0.14.0 stone==3.2.1 stripe==2.60.0 styleframe==4.1 sympy==1.12 tableauserverclient==0.24 tables==3.8.0 tabulate==0.8.9 tenacity==8.2.2 tensorboard==2.12.1 tensorboard-data-server==0.7.0 tensorboard-plugin-wit==1.8.0 tensorflow==2.12.0 tensorflow-decision-forests==1.3.0 tensorflow-estimator==2.12.0 tensorflow-hub==0.14.0 tensorflow-io-gcs-filesystem==0.26.0 tensorflow-probability==0.19.0 termcolor==1.1.0 terminado==0.12.1 testpath==0.5.0 text-unidecode==1.3 textblob==0.15.3 texttable==1.6.4 tfcausalimpact==0.0.13 Theano-PyMC==1.1.2 thinc==8.1.10 threadpoolctl==3.0.0 thresholdclustering==1.1 tifffile==2021.8.30 tiktoken==0.5.2 tokenizers==0.13.2 toml==0.10.2 tomli==2.0.1 tomlkit==0.12.1 toolz==0.11.1 torch==1.12.1 torchvision==0.13.1 tornado==6.3.3 tqdm==4.65.0 trace-updater==0.0.9 traitlets==5.9.0 traittypes==0.2.1 transformers==4.25.1 triad==0.8.4 typed-ast==1.5.1 typer==0.4.0 types-cachetools==5.3.0.5 types-pkg-resources==0.1.3 types-PyYAML==6.0.12 types-requests==2.28.2 types-urllib3==1.26.15 typing-inspect==0.9.0 typing_extensions==4.7.1 tzlocal==3.0 ua-parser==0.10.0 ujson==5.4.0 umap-learn==0.5.1 UpSetPlot==0.6.1 uritemplate==3.0.1 urllib3==1.26.16 user-agents==2.2.0 uszipcode==0.2.6 uvicorn==0.23.2 uvloop==0.17.0 validators==0.20.0 vegafusion==1.5.0 vegafusion-python-embed==1.5.0 vl-convert-python==0.13.1 wasabi==1.1.2 wasmer==1.1.0 wasmer_compiler_cranelift==1.1.0 watchfiles==0.19.0 wcwidth==0.2.5 web3==6.8.0 webargs==8.2.0 webencodings==0.5.1 websocket-client==1.6.1 websockets==11.0.3 Werkzeug==2.0.2 wheel==0.38.4 widgetsnbextension==3.6.6 wordcloud==1.8.1 wrapt==1.12.1 wurlitzer==3.0.2 xarray==0.19.0 xarray-einstats==0.3.0 xgboost==1.7.3 xlrd==2.0.1 xxhash==3.4.1 yarl==1.7.2 yfinance==0.1.87 youtube-data-api==0.0.21 zipp==3.10.0 zstandard==0.21.0
I have a stage with a few CSV files in it. I wanted to create a Snowpark dataframe that read the CSV files and placed the filename as a column in the dataframe. I was then trying to use the filename column downstream but I got a lot of strange errors.
Here's what I started with:
When I convert it to a pandas dataframe and look at the columns, I see everything I expect:
However, if I look at the columns in the Snowpark dataframe, the "FILENAME" column doesn't exist:
If I include the FILENAME in the schema section (
SPT.StructField("FILENAME", SPT.StringType())
), then it does show up in thecsv_spdf.columns
. However, when I try to convert it to a pandas dataframe I get an error that there are duplicate columns:SnowparkFetchDataException: (1406): Failed to fetch a Pandas Dataframe. The error is: Found non-unique column index
Some other strange things that happen here is that if I try to group by the FILENAME column, it can execute the query and return the results:
However, if I try to ask for the columns that are in the above dataframe expression, I get an error: