sjrusso8 / spark-connect-rs

Apache Spark Connect Client for Rust
https://docs.rs/spark-connect-rs
Apache License 2.0
76 stars 13 forks source link

Check example datasets into source control so they're easier to run #42

Closed MrPowers closed 4 months ago

MrPowers commented 4 months ago

The examples currently use paths that are for the Docker workflow.

It would be cool if the examples could also work with non-Docker setups (e.g. when I manually spin up Spark Connect on localhost).

Perhaps we can check in all those data files into this repo, so these examples can work out of the box with Docker and Spark Connect localhost.

sjrusso8 commented 4 months ago

@MrPowers I'm kinda surprised we both didn't realize this.

What we needed to do when trying to run the examples with a local spark connect server was run the sbin/start-connect-server.sh command from the repo directory, and not from the $SPARK_HOME directory. So the full script statement should have been this.

$ $SPARK_HOME/sbin/start-connect-server.sh --packages "org.apache.spark:spark-connect_2.12:3.5.1,io.delta:delta-spark_2.12:3.0.0" \
      --conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp" \
      --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
      --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

The $SPARK_HOME environment variable does need to be set for that script to work.

In the PR #44, I copied all the existing example datasets into a folder datasets/ and updated docker-compose.yml to create a volume from that same directory.

With those two changes, the examples should work if you run them with a local spark connect server, or connecting to the docker container.