snowflakedb / snowpark-python

Snowflake Snowpark Python API
Apache License 2.0
269 stars 111 forks source link

SNOW-1637096: When using order_by, the resulting query does not use the correct defaults for NULL [FIRST | LAST] as per the docs #2150

Open samuelsongsr opened 2 months ago

samuelsongsr commented 2 months ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.10.4 (main, May 26 2022, 13:33:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)

  2. What operating system and processor architecture are you using?

    Linux-3.10.0-1127.19.1.el7.x86_64-x86_64-with-glibc2.17

  3. What are the component versions in the environment (pip freeze)?

    asn1crypto==1.5.1 astroid==2.15.4 asttokens==2.4.0 async-timeout==4.0.2 backcall==0.2.0 boto3==1.26.126 botocore==1.29.126 cachetools==5.3.3 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==3.1.0 click==7.1.2 cloudpickle==2.2.1 colorama==0.4.6 comm==0.1.4 coverage==7.2.5 cryptography==3.4.8 debugpy==1.8.0 decorator==5.1.1 Deprecated==1.2.13 dill==0.3.6 exceptiongroup==1.1.1 executing==2.0.0 factory-boy==3.3.0 Faker==25.8.0 filelock==3.15.1 frozendict==2.4.4 greenlet==2.0.2 hvac==2.2.0 idna==3.4 iniconfig==2.0.0 ipykernel==6.25.2 ipython==8.16.1 isort==5.12.0 jedi==0.19.1 jmespath==1.0.1 jupyter_client==8.3.1 jupyter_core==5.3.2 lazy-object-proxy==1.9.0 lxml==4.9.2 Mako==1.2.4 MarkupSafe==2.0.1 marshmallow_dataclass==7.6.0 marshmallow-enum==1.5.1 marshmallow==3.21.3 matplotlib-inline==0.1.6 mccabe==0.7.0 multimethod==1.10 mypy-extensions==1.0.0 mypy==1.2.0 nest-asyncio==1.5.8 numpy==1.24.3 oauthlib==3.2.2 packaging==23.1 pandas==2.2.2 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 platformdirs==3.5.0 pluggy==1.0.0 prompt-toolkit==3.0.39 psutil==5.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pycparser==2.21 pydantic==1.10.13 Pygments==2.16.1 PyJWT==2.0.1 pylint==2.17.3 pyOpenSSL==21.0.0 pytest-cov==2.12.1 pytest-mock==2.0.0 pytest==7.3.1 python-dateutil==2.8.2 python-gnupg==0.5.0 python-json-logger==2.0.7 pytz==2021.3 PyYAML==6.0.1 pyzmq==25.1.1 redis==4.5.4 requests-oauthlib==1.3.1 requests==2.29.0 s3transfer==0.6.0 six==1.16.0 snowflake-connector-python==3.12.1 snowflake-snowpark-python==1.18.0 sortedcontainers==2.4.0 stack-data==0.6.3 stringcase==1.2.0 structlog==21.5.0 tabulate==0.9.0 tenacity==8.2.3 toml==0.10.2 tomli==2.0.1 tomlkit==0.11.8 tornado==6.3.3 traitlets==5.11.1 typeguard==4.0.0 types-click==7.1.8 types-python-dateutil==2.8.19.12 types-requests==2.29.0.0 types-tabulate==0.9.0.20240106 types-urllib3==1.26.25.12 typing_extensions==4.5.0 typing-inspect==0.8.0 tzdata==2024.1 urllib3==1.26.15 wcwidth==0.2.8 Werkzeug==1.0.1 wrapt==1.15.0

  4. What did you do? -- this query generates: SELECT "A", "B" FROM TEST_TABLE ORDER BY "A" ASC NULLS FIRST asc_query = session.table("TEST_TABLE").select("A", "B").order_by(col("A").asc())

    -- this query generates: SELECT "A", "B" FROM TEST_TABLE ORDER BY "A" DESC NULLS LAST desc_query = session.table("TEST_TABLE").select("A", "B").order_by(col("a").desc())

  5. What did you expect to see? I expected the ASC sort to either use no option for the ordering of null values or use NULLS LAST to keep the default sorting semantics of Snowflake.

    I expected the DESC sort to either use no option for the ordering of null values or use NULLS LAST to keep the default sorting semantics of Snowflake.

    Expectations based on the docs: Order By

    • Unless specified otherwise, NULL values are considered to be higher than any non-NULL values. As a result, the ordering for NULLS depends on the sort order:
    • If the sort order is ASC, NULLS are returned last; to force NULLS to be first, use NULLS FIRST.
    • If the sort order is DESC, NULLS are returned first; to force NULLS to be last, use NULLS LAST.
sfc-gh-sghosh commented 2 months ago

Hello @samuelsongsr ,

Thanks for raising the issue, you are right, as per documentation

Unless specified otherwise, NULL values are considered to be higher than any non-NULL values. As a result, the ordering for NULLS depends on the sort order:

If the sort order is ASC, NULLS are returned last; to force NULLS to be first, use NULLS FIRST.

If the sort order is DESC, NULLS are returned first; to force NULLS to be last, use NULLS LAST.

The output for below query should be NULL at last for ASC order

asc_query = session.table("TEST_TABLE").select("A", "B").order_by(col("A").asc())
asc_results = asc_query.collect()

print("Results for ASC Query:")
for row in asc_results:
    print(row)

The output is similar to using "asc_nulls_first" asc_query1 = session.table("TEST_TABLE").select("A", "B").order_by(col("A").asc_nulls_first())

The same goes for DESC also, As per the doc, the NULL should appear first, but it's coming as LAST

session.table("TEST_TABLE").select("A", "B").order_by(col("A").desc())
its behaving like session.table("TEST_TABLE").select("A", "B").order_by(col("A").desc_nulls_last())

Code:

# Define a schema for the table
schema = StructType([
    StructField("A", StringType()),
    StructField("B", StringType())
])

# Create the table with some example data
data = [("1", "Apple"), ("3", "Banana"), ("2", "Orange"), (None, "Peach")]
df = session.create_dataframe(data, schema)

# Write the DataFrame to a table
df.write.mode("overwrite").save_as_table("TEST_TABLE")

# Query 1: SELECT "A", "B" FROM TEST_TABLE ORDER BY "A" ASC NULLS FIRST
asc_query = session.table("TEST_TABLE").select("A", "B").order_by(col("A").asc())
asc_results = asc_query.collect()

# Print the results of the ASC query
print("Results for ASC Query:")
for row in asc_results:
    print(row)

# Query 2:
asc_query1 = session.table("TEST_TABLE").select("A", "B").order_by(col("A").asc_nulls_first())
asc_results1 = asc_query1.collect()

# Print the results of the ASC query
print("Results for ASC Query:")
for row in asc_results1:
    print(row)

# Query 2: SELECT "A", "B" FROM TEST_TABLE ORDER BY "A" DESC NULLS LAST
desc_query = session.table("TEST_TABLE").select("A", "B").order_by(col("A").desc())
desc_results = desc_query.collect()

# Print the results of the DESC query
print("\nResults for DESC Query:")
for row in desc_results:
    print(row)

# Query 2: 
desc_query2 = session.table("TEST_TABLE").select("A", "B").order_by(col("A").desc_nulls_last())
desc_results2 = desc_query2.collect()

# Print the results of the DESC query
print("\nResults for DESC Query:")
for row in desc_results2:
    print(row)

 Output:
Results for ASC Query:
Row(A=None, B='Peach')
Row(A='1', B='Apple')
Row(A='2', B='Orange')
Row(A='3', B='Banana')
Results for ASC Query:
Row(A=None, B='Peach')
Row(A='1', B='Apple')
Row(A='2', B='Orange')
Row(A='3', B='Banana')

Results for DESC Query:
Row(A='3', B='Banana')
Row(A='2', B='Orange')
Row(A='1', B='Apple')
Row(A=None, B='Peach')

Results for DESC Query:
Row(A='3', B='Banana')
Row(A='2', B='Orange')
Row(A='1', B='Apple')
Row(A=None, B='Peach')

Will work on eliminating it, will update.

Regards, Sujan