Default configuration for Spark standing cluster

caldeirav commented 5 months ago

As a data engineer I want to be able to initialize my Spark environment across multiple jobs or sessions without a complicated series of commands.

In Spark, we can use conf/spark-defaults.conf to set common configuration across runs. This approach is typically used for configuration that doesn’t change often, like defining catalogs. Each invocation will include the defaults, which can be overridden by settings passed through the command-line. The file format is a Java properties file.

When configuring an environment for the datamesh, it is best to include the Iceberg dependencies in the jars directory in our Spark distribution so they are defined when spark is deployed and not fetched each time a session is started.

Reference: https://tabular.io/apache-iceberg-cookbook/getting-started-spark-configuration/

jpaulrajredhat commented 5 months ago

@caldeirav This is how currently configured in our cluster. Spark cluster has default config. when you submit the job ,you can create spark context or session with specific config specific to your job. But it varies based on type of spark mode like Cluster or client or local.

jpaulrajredhat commented 5 months ago

Here is the example, if you don't specify , default to client mode . If you are running a job as a client mode, you have to have all dependencies on your machine where you start your job. That is why you are getting all dependency errors.

caldeirav commented 5 months ago

The average user will not have half of the details required in the above command so this is not right. I suggest you have a look at the article as it explains very clearly that these parameters should be set in the config. My expectation is in the data mesh context the user should only have to pass the iceberg catalog. Everything else should be set. This is not only to be user-friendly but for security purpose as well.

I do not understand your comment on the client mode. What you seem to describe is actually server mode (I need all the dependencies because I am basically running my own cluster). What we want is a shared spark cluster (standalone) with capacity going up and down depending on how many clients processes are writing / reading from the same iceberg catalog (which is why running on kube is good). Dependencies on available iceberg catalogs should be managed in our infra-as-code, not by users.

jpaulrajredhat commented 5 months ago

Cluster mode - driver program run in the cluster Client mode. - driver progrma runs on your machine / laptop or where ever you are running your job.

https://techvidvan.com/tutorials/spark-modes-of-deployment/

jpaulrajredhat commented 5 months ago

When you submit job from notebook its default to client mode and driver is running on the notebook that's why you are getting dependency exception. Because your expecting job execution response immediately

jpaulrajredhat commented 5 months ago

I can configure all data mesh required settings in the cluster.,but still you need dependencies if you run you job as a client mode regardless of what you have it on the cluster. If you run as a cluster mode then you don't need to have dependencies on your client machine where you run the job. Cluster mode moslty used for long running job where you run and forget it .

caldeirav commented 5 months ago

No user want or should have to manage any of these dependencies so it seems I need the cluster mode. So can I take it if i remove the spark configuration elements for iceberg in the code, they will be set by default? Example welcome.

jpaulrajredhat commented 5 months ago

Fixed default configuration for spark standalone customer. All data mesh configuration added to notebook image and deployed to openshift .

Here is the new notebook with default configuration .

https://dataproduct-notebook-datamesh-demo.apps.rosa-8grhg.ssnp.p1.openshiftapps.com

jpaulrajredhat commented 5 months ago

configured spark connect and deployed to openshift. You can directly create spark context from notebook without any spark dependency and jdk and submit to spark cluster.

nd submit to cluster

caldeirav commented 5 months ago

configured spark connect and deployed to openshift. You can directly create spark context from notebook without any spark dependency and jdk and submit to spark cluster.

nd submit to cluster

This looks great. Do you see Spark Connect used mostly for read / analysis within an app or can we leverage it for write as well?

jpaulrajredhat commented 5 months ago

configured spark connect and deployed to openshift. You can directly create spark context from notebook without any spark dependency and jdk and submit to spark cluster. nd submit to cluster

This looks great. Do you see Spark Connect used mostly for read / analysis within an app or can we leverage it for write as well?

Yes, you can leverage it for both read / write . It support spark sql and data frame and Pandas Api https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html

jpaulrajredhat commented 5 months ago

configured spark connect and deployed to openshift. You can directly create spark context from notebook without any spark dependency and jdk and submit to spark cluster. nd submit to cluster

This looks great. Do you see Spark Connect used mostly for read / analysis within an app or can we leverage it for write as well?

Yes, you can leverage it for both read / write . It support spark sql and data frame and Pandas Api https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html

I haven't tested all functionally. I'll test it this week end and let you know .

jpaulrajredhat commented 5 months ago

configured spark connect and deployed to openshift. You can directly create spark context from notebook without any spark dependency and jdk and submit to spark cluster. nd submit to cluster

This looks great. Do you see Spark Connect used mostly for read / analysis within an app or can we leverage it for write as well?

Yes, you can leverage it for both read / write . It support spark sql and data frame and Pandas Api https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html

I haven't tested all functionally. I'll test it this week end and let you know . Here is Web UI for spark connect which is deployed in openshift where you can see all your spark connect session and execution history . https://spark-connector-spark-datamesh-demo.apps.rosa-8grhg.ssnp.p1.openshiftapps.com/connect/

jpaulrajredhat commented 5 months ago

Tried with default spark configuration with spark connect. it didn't work, getting the same error.

After looking at the document, Spark connect doesn't support a catalog for pySpark . As of now ,catalog support is only available for scala. So, we can't use spark connect for our use case. We can use Spark connect only for Dataframe and Spark sql functionality.

https://spark.apache.org/docs/latest/spark-connect-overview.html

I have a separate notebook image with default configuration with all datamesh configuration that support our Data Mesh requirements, . https://dataproduct-notebook-datamesh-demo.apps.rosa-8grhg.ssnp.p1.openshiftapps.com/

caldeirav commented 5 months ago

We only need DataFrame as long as we can use Write into Iceberg. So the question is cab we make SparkConnect work for us through DataFrame?

If not, please provide me with an alternative way to create the spark context but at the end of the day we want to at least:

1) Be able to create and process data frames through distributed remote workers 2) Write the result in Iceberg through DataFrame.write

jpaulrajredhat commented 5 months ago

Yes, just figured out that we can write to iceberg through hadoop native catalog instead of Hive. I'm able to make it work on my local . I need to test and clean up the config. I'll let you know tomorrow.

jpaulrajredhat commented 5 months ago

Spark connect integration with notebook has completed and is working. Just for testing added spark configuration to spark context. I'll work on Monday to load it from the config file. When you use spark connect , the notebook doesn't need any Spark dependency other than PySpark.

Loading default config needs some python code required , I'll do it next week.

Here is the spark connect integration example notebook added to your product for your reference. This example writes data directly to the iceberg using data frame through spark native Hadoop catalog. Spark connect doesn't not support spark context and external catalog, only supports Spark sql and data frame.

Key advantage is that it is much faster than normal spark context through hive catalog.

Since we are using iceberg, data can be read from any catalog as long as we use the same schema used for writing.

https://jupyter-notebook-route-datamesh-demo.apps.rosa-8grhg.ssnp.p1.openshiftapps.com/lab/tree/spark-connect-iceberg.ipynb

jpaulrajredhat commented 3 months ago

this is issue has been fixed

opendatahub-io-contrib / datamesh-platform

Default configuration for Spark standing cluster #37