snowflakedb / spark-snowflake

Snowflake Data Source for Apache Spark.
http://www.snowflake.net
Apache License 2.0
217 stars 99 forks source link

SNOW-1563484 Remove Advanced Query Pushdown Feature #572

Closed sfc-gh-bli closed 3 months ago

sfc-gh-bli commented 3 months ago

Remove Advanced Query Pushdown Feature

jalpan-randeri commented 3 months ago

@sfc-gh-bli , @sfc-gh-yuwang

Can you share some insights why advanced pushdown was removed ? It is creating conflict with the PR - https://github.com/snowflakedb/spark-snowflake/pull/535

That was open and pending review Since Nov. 2023. Despite requesting multiple time reviews.

How does this change will improve performance ?

satyanash commented 3 months ago

Removed the Advanced Query Pushdown feature

@sfc-gh-bli @sfc-gh-yuwang This is quite unexpected! What is the justification for removing this feature?

A conversion tool which can convert DataFrames between Spark and Snowpark will be introduced in the future Spark connector release soon. It will be an alternative of Advanced Query Pushdown feature.

It is unclear if this will work from any spark cluster using the spark-snowflake connector or will it only work on snowpark.

It would be great if there is clarification on the timeline and how "soon" this will be available.

parthshyara commented 3 months ago

@sfc-gh-bli @sfc-gh-yuwang We are relying heavily on the spark-snowflake connector, hence the push for Spark 3.5 Support but the removal of Advanced Query Pushdown is quite a surprise. Can you please share the details on why was this feature removed and what is the plan on alternative? The update on the releases page is quite ambiguous and does not provide clarity.

sfc-gh-bli commented 3 months ago

@sfc-gh-bli , @sfc-gh-yuwang

Can you share some insights why advanced pushdown was removed ? It is creating conflict with the PR - #535

That was open and pending review Since Nov. 2023. Despite requesting multiple time reviews.

How does this change will improve performance ?

We decided to remove Advanced query pushdown feature because:

  1. The coverage of Advanced Query Pushdown feature has been declined in past years. Not all Spark logical plans can be converted to Snowflake SQL queries. The Advanced Query Pushdown feature worked well when it was introduced about 7 years ago. At that time, it could convert most Spark DataFrame operations to Snowflake SQL. However, Spark has introduced many internal changes in the past years. The coverage of Advanced Query Pushdown has declined. This issue is more serious in the Spark 3.5.
  2. The behavior is unpredictable. The behavior of Spark and Snowflake are not exactly same. With Advanced Query Pushdown, user can't predict if the operation will be processed in the Spark, Snowflake, or mix. A small change of DataFrame operation may result in the query can't be pushed down to Snowflake.
  3. One connector artifact may be only compatible with one Spark release. To support multiple Spark releases, we have to generate multiple connector artifacts. We only support the latest three major releases of Spark. However, the case it more and more complicated. for example, the artifact compiled with Spark 3.4.0 doesn't work with Spark 3.4.1. Spark changed logical plan design even in the patch release.

The improvement of removal of Advanced Query Pushdown feature

  1. one artifact can be compatible with multiple Spark release. When Spark release a new version, for example, 3.6.0, the user doesn't have to wait for support. Or, if the user still uses a very old version of Spark, for example 2.1.0, they can continue to upgrade the connector in the most cases.
  2. Users can choose where to process DataFrame operators now, in Spark or Snowflake. Some operators have different behaviors in Spark and Snowflake. About the performance, some operators are faster in Snowflake, at the same time, some are faster in Spark. With the new Snowpark conversion tool, or create Dataframe from query, users can decide which operators should be processed in the Snowflake, and which should be processed in the Spark.
sfc-gh-bli commented 3 months ago

Removed the Advanced Query Pushdown feature

@sfc-gh-bli @sfc-gh-yuwang This is quite unexpected! What is the justification for removing this feature?

A conversion tool which can convert DataFrames between Spark and Snowpark will be introduced in the future Spark connector release soon. It will be an alternative of Advanced Query Pushdown feature.

It is unclear if this will work from any spark cluster using the spark-snowflak connector or will it only work on snowpark.

It would be great if there is clarification on the timeline and how "soon" this will be available.

The conversion tool should works with any Spark cluster where the Spark connector works now. It pretty similar to the Advanced Query Pushdown feature. for example, loading data from Snowflake to Spark. Without Advanced Query Pushdown

val df = spark.read.format("snowflake").options(...).load()
df.select(...).filter(...).union(...).join(...).collect() // connector will try to push down this operators but not guaranteed 

with conversion tool

val snowparkDataFrame = snowpark.table(...).select(...).filter(...).union(...).join(...) // all of these operators will be processed in Snowflake.
val sparkDataFrame = toSpark(snowparkDataFrame, sparkSession)
// all operations on sparkDataFrame will be processed in Spark cluster.

Unlike Advanced Query Pushdown, the new conversion tool also support Spark to Snowpark conversion, for example

val sparkDataFrame = ...
val snowparkDataFrame = toSnowpark(sparkDataFrame, snowparkSession) // all operators on snowparkDataframe will be processed in Snowflake.

how "soon" this will be available.

We are working on it now. It will be available in September, the connector 3.1.0.

sfc-gh-bli commented 3 months ago

@sfc-gh-bli @sfc-gh-yuwang We are relying heavily on the spark-snowflake connector, hence the push for Spark 3.5 Support but the removal of Advanced Query Pushdown is quite a surprise. Can you please share the details on why was this feature removed and what is the plan on alternative? The update on the releases page is quite ambiguous and does not provide clarity.

In the development of Spark 3.5 support, we saw may internal changes of Spark logical plan and internal row system, which significantly declined the coverage of Advanced Query Pushdown. We also saw some wrong results due to the change of internal row system. It is a long discussion to remove the Advanced Query Pushdown. To speed up the Spark 3.5 and feature Spark release support, we finally decided to remove this feature. We will continue to support connector 2.x.x for up to two years, which still has Advanced Query Pushdown feature. The branch of 2.x.x is https://github.com/snowflakedb/spark-snowflake/tree/v2_master However, it is only compatible with Spark 3.2, 3.3, and 3.4.0 (not 3.4.1).

There are two alternatives of Advanced Query Pushdown. 1, instead of directly loading DataFrame from dbtable, loading from query. Those SQL queries will be processed in the Snowflake. So if you use query more than dbtable in your workload, Advanced Query Pushdown may be a useless feature in your case. 2, Using Snowpark and Spark conversion tool. it will be introduced in the connector 3.1.0. You can build a Snowpark DataFrame first, and then convert to Spark DataFrame. The operations of Snowpark DataFrame are always processed on the Snowflake side.

parthshyara commented 1 week ago

@sfc-gh-bli We are still waiting on the new release of the connector with snowpark integration, to evaluate if we can use it. Can you please help with some issue / PR where the progress is being tracked? The initial estimation for the same was September.