sjrusso8 / spark-connect-rs

Apache Spark Connect Client for Rust
https://docs.rs/spark-connect-rs
Apache License 2.0
52 stars 11 forks source link

Update readme tracker #37

Closed abrassel closed 1 month ago

abrassel commented 1 month ago

Description

I went through the pyspark documentation and attempted to

  1. Map the docs onto spark connect
  2. Go through the source code and determine which are and aren't implemented.

Related Issue(s)

Documentation

https://spark.apache.org/docs/latest/api/python/index.html

sjrusso8 commented 1 month ago

Thanks for doing this! Just a few notes for some of the sections.

  1. Lets remove the section for sparkContext and the row from the SparkSession table, it's a JVM attribute and isn't support with spark connect
  2. Update the comment for remote to be refer to Spark Connection connect string and have it linked this page https://github.com/apache/spark/blob/master/connector/connect/docs/client-connection-string.md
  3. I think enableHiveSupport is not supported with spark connect
  4. These under StreamingQuery are implemented.
    • id
    • run_id (should be changed to runId)
    • name
    • awaitTermination
    • lastProgress
    • recentProgress
    • isActive
    • status
  5. These under DataFrameReader are implemented.
    • format
    • load
    • option
    • options
    • table

I'm not sure if UdfRegistration, and UdtfRegistration would be possible in rust. I think each of those depends on the JVM or a specific python function to be serialized and then evaluated on the workers.

abrassel commented 1 month ago

I think that we can probably do UDFs if we use pyo3 or equivalent to generate python lambdas

abrassel commented 1 month ago

thanks for the feedback @sjrusso8 ! I think I implemented all of the changes.