snowflakedb / snowflake-connector-python

Snowflake Connector for Python
https://pypi.python.org/pypi/snowflake-connector-python/
Apache License 2.0
568 stars 456 forks source link

SNOW-806291: Fix Snowpark much slower than Spark Connector on Databricks (for collect() and toPandas()) #1946

Open sfc-gh-yuwang opened 1 month ago

sfc-gh-yuwang commented 1 month ago

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

  2. Fill out the following pre-review checklist:

    • [ ] I am adding a new automated test(s) to verify correctness of my new code
    • [ ] I am adding new logging messages
    • [ ] I am adding a new telemetry message
    • [ ] I am modifying authorization mechanisms
    • [ ] I am adding new credentials
    • [ ] I am modifying OCSP code
    • [ ] I am adding a new dependency
  3. Please describe how your code solves the related issue.

    Please write a short description of how your code change solves the related issue.

sfc-gh-yuwang commented 3 weeks ago

Since this PR will not change any behavior of connector, but only speed up the to_pandas function, there is no test needed to be added. Instead, I ran a existed jenkins job to verify that this is a valid PR and does not break anything. Here is a link to successful run with snowpark: https://ci-dev-142.int.snowflakecomputing.com/job/SnowparkPythonSnowflakePythonClientRegressRunner/633/

sfc-gh-yuwang commented 1 week ago

Here is a link of research on to_pandas performance, which explains why this change works: https://docs.google.com/document/d/1HK7tNYoSLxQHSl7e_TkGxOZoz_Jqsjx0mhS6t6K_D_g/edit?usp=sharing

sfc-gh-yuwang commented 5 days ago

I have change the way this improvement is implemented, for now, the improvements only work when call to_pandas(), to_pandas_batches() and other function will not use this improvements