snowflakedb / snowpark-python

Snowflake Snowpark Python API
Apache License 2.0
255 stars 106 forks source link

SNOW-837419: Self-join on Snowpark 1.4.0 causes an `invalid_identifier` error #886

Closed dnxie12 closed 1 year ago

dnxie12 commented 1 year ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?
Python 3.8.16 (default, Dec  7 2022, 01:24:57) 
[Clang 14.0.0 (clang-1400.0.29.202)]
  1. What operating system and processor architecture are you using?
macOS-12.2-arm64-arm-64bit
  1. What are the component versions in the environment (pip freeze)?
agate==1.6.3
alembic==1.11.1
aniso8601==9.0.1
anyio==3.7.0
asn1crypto==1.5.1
attrs==23.1.0
Babel==2.12.1
backoff==2.2.1
black==22.12.0
boto3==1.26.150
botocore==1.29.150
certifi==2023.5.7
cffi==1.15.1
cfgv==3.3.1
charset-normalizer==2.1.1
click==8.1.3
cloudpickle==2.0.0
colorama==0.4.5
coloredlogs==14.0
commonmark==0.9.1
coverage==7.2.4
croniter==1.3.15
cronitor==4.6.0
cryptography==36.0.2
dagit==1.1.21
dagster==1.1.21
dagster-aws==0.17.21
dagster-cloud==1.1.21
dagster-cloud-cli==1.1.21
dagster-dbt==0.17.21
dagster-graphql==1.1.21
dask==2023.5.0
dbt-core==1.2.0
dbt-extractor==0.4.1
dbt-junitxml==0.1.5
dbt-snowflake==1.2.0
Deprecated==1.2.14
distlib==0.3.6
docstring-parser==0.15
elasticsearch==7.17.9
elasticsearch-dsl==7.4.1
exceptiongroup==1.1.1
filelock==3.12.0
flatdict==4.0.1
freezegun==1.2.2
fsspec==2023.5.0
future==0.18.3
github3.py==4.0.1
gql==3.4.1
graphene==3.2.2
graphql-core==3.2.3
graphql-relay==3.2.0
grpcio==1.47.5
grpcio-health-checking==1.43.0
h11==0.14.0
hologram==0.0.15
httptools==0.5.0
humanfriendly==10.0
humanize==4.6.0
identify==2.5.23
idna==3.4
importlib-metadata==6.6.0
importlib-resources==5.12.0
iniconfig==2.0.0
isodate==0.6.1
isort==5.12.0
jaraco.classes==3.2.3
Jinja2==2.11.3
jmespath==1.0.1
joblib==1.2.0
jsonschema==3.2.0
junit-xml==1.9
keyring==23.13.1
leather==0.3.4
locket==1.0.0
Logbook==1.5.3
Mako==1.2.4
markdown-it-py==2.2.0
MarkupSafe==2.0.1
mashumaro==2.9
mdurl==0.1.2
minimal-snowplow-tracker==0.0.2
more-itertools==9.1.0
msgpack==1.0.5
multidict==6.0.4
mypy-extensions==1.0.0
mysqlclient==2.1.1
networkx==2.8.8
nltk==3.8.1
nodeenv==1.7.0
numpy==1.24.3
oscrypto==1.3.0
packaging==21.3
pandas==2.0.2
parsedatetime==2.4
partd==1.4.0
pathspec==0.11.1
pendulum==2.1.2
pex==2.1.137
platformdirs==3.5.0
pluggy==1.0.0
pre-commit==3.2.2
prompt-toolkit==3.0.38
protobuf==3.20.3
psycopg2==2.9.6
pyarrow==8.0.0
pycparser==2.21
pycryptodomex==3.18.0
pydantic==1.10.9
PyGithub @ git+https://github.com/creativemarket/PyGithub@538b30b5814fa5093b3785ed022ec58cc2e86bf0
Pygments==2.15.1
PyJWT==2.7.0
PyNaCl==1.5.0
pyOpenSSL==22.0.0
pyparsing==3.0.9
pyrsistent==0.19.3
pytest==7.3.1
pytest-check==2.1.4
pytest-cov==4.0.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-slugify==8.0.1
pytimeparse==1.1.8
pytz==2023.3
pytzdata==2020.1
PyYAML==6.0
questionary==1.10.0
regex==2023.6.3
replicate==0.8.3
requests==2.31.0
requests-toolbelt==0.10.1
responses==0.23.1
rich==13.4.1
s3transfer==0.6.1
shellingham==1.5.0.post1
six==1.16.0
sniffio==1.3.0
snowflake-connector-python==2.7.12
snowflake-snowpark-python==1.4.0
SQLAlchemy==1.4.48
sqlparse==0.4.4
starlette==0.28.0
tabulate==0.9.0
text-unidecode==1.3
tomli==2.0.1
toolz==0.12.0
toposort==1.10
tqdm==4.65.0
typer==0.9.0
types-PyYAML==6.0.12.9
typing_extensions==4.6.3
tzdata==2023.3
universal_pathlib==0.0.23
uritemplate==4.1.1
urllib3==1.26.16
uvicorn==0.22.0
uvloop==0.17.0
virtualenv==20.23.0
watchdog==3.0.0
watchfiles==0.19.0
wcwidth==0.2.6
websockets==11.0.3
Werkzeug==2.1.2
wrapt==1.15.0
yarl==1.9.2
zipp==3.15.0
  1. What did you do?

Self-join on a dataframe created through filter leads to an invalid_identifier error. The same code does not cause such errors on Snowpark 0.12.0 (last version I was on). Here's an example to reproduce this error:

df1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"])
df2 = df1.filter(col("a") != 5)
df1.join(df2, df1.a == df2.a).select(
    df1.a.alias("a_1"), df2.a.alias("a_2"), df1.b
).show()
  1. What did you expect to see?

The join should happen without errors.

  1. Can you set logging to DEBUG and collect the logs?
Traceback (most recent call last):
  File "self-join.py", line 21, in <module>
    df1.join(df2, df1.a == df2.a).select(
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/telemetry.py", line 184, in wrap
    r = func(*args, **kwargs)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/dataframe.py", line 1027, in select
    return self._with_plan(self._select_statement.select(names))
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/select_statement.py", line 528, in select
    new_column_states = derive_column_states_from_subquery(cols, self)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/select_statement.py", line 988, in derive_column_states_from_subquery
    from_c_state = from_.column_states.get(quoted_c_name)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/select_statement.py", line 418, in column_states
    self._column_states = self.from_.column_states
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/select_statement.py", line 233, in column_states
    self.snowflake_plan.attributes, self.analyzer
  File "/opt/homebrew/Cellar/python@3.8/3.8.16/Frameworks/Python.framework/Versions/3.8/lib/python3.8/functools.py", line 967, in __get__
    val = self.func(instance)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/snowflake_plan.py", line 237, in attributes
    output = analyze_attributes(self.schema_query, self.session)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/schema_utils.py", line 82, in analyze_attributes
    return session._get_result_attributes(sql)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/session.py", line 1238, in _get_result_attributes
    return self._conn.get_result_attributes(query)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/snowflake_plan.py", line 174, in wrap
    raise ne.with_traceback(tb) from None
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/analyzer/snowflake_plan.py", line 109, in wrap
    return func(*args, **kwargs)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/snowpark/_internal/server_connection.py", line 206, in get_result_attributes
    return convert_result_meta_to_attribute(self._cursor.describe(query))
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/connector/cursor.py", line 826, in describe
    self.execute(*args, **kwargs)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/connector/cursor.py", line 804, in execute
    Error.errorhandler_wrapper(self.connection, self, error_class, errvalue)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/connector/errors.py", line 276, in errorhandler_wrapper
    handed_over = Error.hand_to_other_handler(
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/connector/errors.py", line 331, in hand_to_other_handler
    cursor.errorhandler(connection, cursor, error_class, error_value)
  File "/Users/dnxie/.local/share/virtualenvs/analytics-X-832nVn/lib/python3.8/site-packages/snowflake/connector/errors.py", line 210, in default_errorhandler
    raise error_class(
snowflake.snowpark.exceptions.SnowparkSQLException: (1304): 01aceb74-0404-92ce-0048-900704735266: 000904 (42000): SQL compilation error: error line 1 at position 193
invalid identifier 'A'
sfc-gh-stan commented 1 year ago

Hi @dnxie12 , I can actually reproduce the same error against snowflake-snowpark-python==0.12.0. Could you please double check and confirm whether this is a regression?

sfc-gh-stan commented 1 year ago

Meanwhile, you could work around this by doing:

df2 = copy(df1.filter(col("a") != 5))
dnxie12 commented 1 year ago

Hi @dnxie12 , I can actually reproduce the same error against snowflake-snowpark-python==0.12.0. Could you please double check and confirm whether this is a regression?

Ah that's odd, I re-tried on 0.12.0 and it works as expected, with the following generated SQL:

SELECT  *  FROM ( SELECT "l_nund_A" AS "A_1", "r_ifbk_A" AS "A_2", "l_nund_B" FROM ( SELECT  *  FROM (( SELECT "A" AS "l_nund_A", "B" AS "l_nund_B" FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM  VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT), (5 :: INT, 6 :: INT)))) AS SNOWPARK_TEMP_TABLE_LX74PXI7IR INNER JOIN ( SELECT "A" AS "r_ifbk_A", "B" AS "r_ifbk_B" FROM ( SELECT  *  FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM  VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT), (5 :: INT, 6 :: INT))) WHERE ("A" != 5 :: INT))) AS SNOWPARK_TEMP_TABLE_UXJOQWFJGO ON ("l_nund_A" = "r_ifbk_A")))) LIMIT 10 OFFSET 0

Thanks for the workaround!