abrassel commented 1 month ago

Description

I went through the pyspark documentation and attempted to

Map the docs onto spark connect
Go through the source code and determine which are and aren't implemented.

Related Issue(s)

closes #18

Documentation

https://spark.apache.org/docs/latest/api/python/index.html

sjrusso8 commented 1 month ago

Thanks for doing this! Just a few notes for some of the sections.

Lets remove the section for sparkContext and the row from the SparkSession table, it's a JVM attribute and isn't support with spark connect
Update the comment for remote to be refer to Spark Connection connect string and have it linked this page https://github.com/apache/spark/blob/master/connector/connect/docs/client-connection-string.md
I think enableHiveSupport is not supported with spark connect
These under StreamingQuery are implemented.
- id
- run_id (should be changed to runId)
- name
- awaitTermination
- lastProgress
- recentProgress
- isActive
- status
These under DataFrameReader are implemented.
- format
- load
- option
- options
- table

I'm not sure if UdfRegistration, and UdtfRegistration would be possible in rust. I think each of those depends on the JVM or a specific python function to be serialized and then evaluated on the workers.

abrassel commented 1 month ago

I think that we can probably do UDFs if we use pyo3 or equivalent to generate python lambdas

abrassel commented 1 month ago

thanks for the feedback @sjrusso8 ! I think I implemented all of the changes.

sjrusso8 / spark-connect-rs

Update readme tracker #37

Description

Related Issue(s)

Documentation