sjrusso8 / spark-connect-rs

Apache Spark Connect Client for Rust
https://docs.rs/spark-connect-rs
Apache License 2.0
52 stars 11 forks source link

feat(dataframe): implement transform #34

Closed sjrusso8 closed 1 month ago

sjrusso8 commented 1 month ago

Description

feat(dataframe): implement transform

Example Usage

This uses closures and differs slightly from the pyspark implementation. Pyspark allows for positional or kwargs args. It's a little tricky to implement that specific option in rust. So I opted for just a closure that accepts and returns a DataFrame and it's on the user to provide any additional args as part of the captured scope of the closure.

let df = spark.range(None, 1, 1, None);

// closure with a captured value from the immediate scope
// results in the `lit` val being 100
let val: i64 = 100;
let func = |df: DataFrame| -> DataFrame { df.withColumn("new_col", lit(val)).select("new_col") };

// a dataframe with a column `new_col` and a value of `100`
let df = df.transform(func);
sjrusso8 commented 1 month ago

@hntd187 and @abrassel looking for some input on this :) Does this implementation make sense?

PS @MrPowers I know you are a big advocate of using the transform method. How does this look?

hntd187 commented 1 month ago

I wouldn't try and emulate the python call signature, that is likely something that will drive you insane. Your definition is very similar to the scala one https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#transform[U](t:org.apache.spark.sql.Dataset[T]=%3Eorg.apache.spark.sql.Dataset[U]):org.apache.spark.sql.Dataset[U]

So I think 1 function call here Df in, Df out is fine, and people can just chain them along as they go. Yea, and the callsite is self too, so I think this is good.

abrassel commented 1 month ago

I also like this implementation :)