sjrusso8 / spark-connect-rs

Apache Spark Connect Client for Rust
https://docs.rs/spark-connect-rs
Apache License 2.0
52 stars 11 forks source link

Add foreach Functionality for DataFrame Operations #20

Open Shindora opened 2 months ago

Shindora commented 2 months ago

This pull request introduces the foreach method to the DataFrame API, enabling users to apply a closure to each row of the DataFrame asynchronously. The foreach method iterates over the rows of the DataFrame and applies the provided closure, allowing users to perform custom operations or side effects on each row.

Details: Implemented the foreach method in the DataFrame struct, which asynchronously applies a closure to each row. Added support for asynchronous iteration over the rows of the DataFrame using the foreach method. Ensured error handling and propagation of errors encountered during the iteration process.

Testing: Added unit tests to validate the functionality of the foreach method. Tested the foreach method with various scenarios, including empty DataFrames, DataFrames with different numbers of rows, and DataFrames with different column types. Conducted integration testing to ensure compatibility with other DataFrame operations and functionalities.

Additional Notes:

The foreach method provides a convenient way to perform asynchronous operations on DataFrame rows without collecting the DataFrame into memory. Users should be aware of the potential concurrency issues and thread safety considerations when using the foreach method in multi-threaded environments.

Screenshot 2024-04-21 at 03 17 33
sjrusso8 commented 2 months ago

Thanks for making this pull request! Have not had a good chance to look at implementing things like foreach. Technically the foreach method is not available on the released versions of Spark Connect. The roadmap seems to indicate that this will be added for Spark v4.

Correct me if I am wrong, but I think this implementation might be a little more involved. The planned implementation for pyspark connect client is to register the provided function as a UDF and then call collect as a way to force the action. So this would serialize the function, distribute the work to all the workers, and then collect the result. This implementation would first collect all the values onto the client and then apply the func over the RecordBatch. This might not have the expected result of not collecting the DataFrame into memory on the client.