Closed sfc-gh-bli closed 3 months ago
@sfc-gh-bli , @sfc-gh-yuwang
Can you share some insights why advanced pushdown was removed ? It is creating conflict with the PR - https://github.com/snowflakedb/spark-snowflake/pull/535
That was open and pending review Since Nov. 2023. Despite requesting multiple time reviews.
How does this change will improve performance ?
Removed the Advanced Query Pushdown feature
@sfc-gh-bli @sfc-gh-yuwang This is quite unexpected! What is the justification for removing this feature?
A conversion tool which can convert DataFrames between Spark and Snowpark will be introduced in the future Spark connector release soon. It will be an alternative of Advanced Query Pushdown feature.
It is unclear if this will work from any spark cluster using the spark-snowflake
connector or will it only work on snowpark.
It would be great if there is clarification on the timeline and how "soon" this will be available.
@sfc-gh-bli @sfc-gh-yuwang
We are relying heavily on the spark-snowflake
connector, hence the push for Spark 3.5 Support
but the removal of Advanced Query Pushdown is quite a surprise. Can you please share the details on why was this feature removed and what is the plan on alternative? The update on the releases page is quite ambiguous and does not provide clarity.
@sfc-gh-bli , @sfc-gh-yuwang
Can you share some insights why advanced pushdown was removed ? It is creating conflict with the PR - #535
That was open and pending review Since Nov. 2023. Despite requesting multiple time reviews.
How does this change will improve performance ?
We decided to remove Advanced query pushdown feature because:
The improvement of removal of Advanced Query Pushdown feature
query
, users can decide which operators should be processed in the Snowflake, and which should be processed in the Spark.Removed the Advanced Query Pushdown feature
@sfc-gh-bli @sfc-gh-yuwang This is quite unexpected! What is the justification for removing this feature?
A conversion tool which can convert DataFrames between Spark and Snowpark will be introduced in the future Spark connector release soon. It will be an alternative of Advanced Query Pushdown feature.
It is unclear if this will work from any spark cluster using the
spark-snowflak
connector or will it only work on snowpark.It would be great if there is clarification on the timeline and how "soon" this will be available.
The conversion tool should works with any Spark cluster where the Spark connector works now. It pretty similar to the Advanced Query Pushdown feature. for example, loading data from Snowflake to Spark. Without Advanced Query Pushdown
val df = spark.read.format("snowflake").options(...).load()
df.select(...).filter(...).union(...).join(...).collect() // connector will try to push down this operators but not guaranteed
with conversion tool
val snowparkDataFrame = snowpark.table(...).select(...).filter(...).union(...).join(...) // all of these operators will be processed in Snowflake.
val sparkDataFrame = toSpark(snowparkDataFrame, sparkSession)
// all operations on sparkDataFrame will be processed in Spark cluster.
Unlike Advanced Query Pushdown, the new conversion tool also support Spark to Snowpark conversion, for example
val sparkDataFrame = ...
val snowparkDataFrame = toSnowpark(sparkDataFrame, snowparkSession) // all operators on snowparkDataframe will be processed in Snowflake.
how "soon" this will be available.
We are working on it now. It will be available in September, the connector 3.1.0
.
@sfc-gh-bli @sfc-gh-yuwang We are relying heavily on the
spark-snowflake
connector, hence the push forSpark 3.5 Support
but the removal of Advanced Query Pushdown is quite a surprise. Can you please share the details on why was this feature removed and what is the plan on alternative? The update on the releases page is quite ambiguous and does not provide clarity.
In the development of Spark 3.5 support, we saw may internal changes of Spark logical plan and internal row system, which significantly declined the coverage of Advanced Query Pushdown. We also saw some wrong results due to the change of internal row system. It is a long discussion to remove the Advanced Query Pushdown. To speed up the Spark 3.5 and feature Spark release support, we finally decided to remove this feature. We will continue to support connector 2.x.x for up to two years, which still has Advanced Query Pushdown feature. The branch of 2.x.x is https://github.com/snowflakedb/spark-snowflake/tree/v2_master However, it is only compatible with Spark 3.2, 3.3, and 3.4.0 (not 3.4.1).
There are two alternatives of Advanced Query Pushdown.
1, instead of directly loading DataFrame from dbtable
, loading from query
. Those SQL queries will be processed in the Snowflake. So if you use query
more than dbtable
in your workload, Advanced Query Pushdown may be a useless feature in your case.
2, Using Snowpark and Spark conversion tool. it will be introduced in the connector 3.1.0
. You can build a Snowpark DataFrame first, and then convert to Spark DataFrame. The operations of Snowpark DataFrame are always processed on the Snowflake side.
@sfc-gh-bli We are still waiting on the new release of the connector with snowpark integration, to evaluate if we can use it. Can you please help with some issue / PR where the progress is being tracked? The initial estimation for the same was September.
Remove Advanced Query Pushdown Feature