[FEATURE] Manage Spark History Server as a deployment via the Spark Operator helm chart.

peter-mcclonski commented 5 months ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

The Spark History server is a valuable debugging and process tracing tool. Currently, deployment of the history server would have to occur independently from the operator. It would be a convenience to manage the Spark History Server (SHS) via the Spark Operator helm chart.

Describe the solution you would like

A new section shall be added to the spark operator helm chart to define parameters for the SHS deployment. We note that a confounding element of this feature is storage layers. SHS is dependent on some accessible storage layer where spark logs can be found. The simplest implementation is a shared NFS volume, but blob storage such as S3 or an Azure storage account are common solutions that should be easy to use with our implementation. These third party solutions require additional libraries to be loaded into the classpath-- a task that SHS fails to trivialize.

Describe alternatives you have considered

The alternative involves individuals rolling their own deployments for SHS-- a non-trivial process.

Additional context

If we choose to pursue this, we may also wish to consider managing deployment of the Hive Thrift Server.

peter-mcclonski commented 5 months ago

Suggested Architecture

SHS will exist as a wholly separate deployment from spark-operator, as a disjoint chart.
In order to resolve the problem of dynamically pulling in dependencies/packages, an initcontainer shall be spun up which populates a volume with the union of the default $SPARK_HOME/jars and the result of java -Divy.cache.dir=$SPARK_HOME -Divy.home=$SPARK_HOME -jar $SPARK_HOME/jars/ivy-2.5.1.jar -dependency [PACKAGE]. This populated volume shall be mounted in the SHS container as $SPARK_HOME/jars
$SPARK_HOME/conf/spark.conf shall be mounted as a volume populated by a raw text block in the helm chart.
Log storage shall default to a PVC.
If SHS is enabled, that does not necessarily imply that logging is enabled in your spark job configuration.

peter-mcclonski commented 5 months ago

Did some initial work on this just to feel it out-- Got automatic resolution of packages working via initcontainers. It's a bit gross, but it works as a start.

Major TODO items:

[ ] Add arbitrary volume/volumeMount support
[ ] Add support for pulling jars, rather than solely packages
[ ] Add a clean mechanism for mounting spark-defaults.conf
[ ] Create an example that works out of the box-- The hard part being a zero-barrier-to-entry Volume accessible across nodes.
[ ] Docs updates
[ ] General cleanup / hardening

peter-mcclonski commented 5 months ago

Alternatively-- @yuchaoran2011 Do you think it would be worth reviving https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server and the associated chart and (potentially) having it live here, adjacent to but disconnected from the actual operator chart? I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild. We're working on one as part of boozallen/aissemble#66 (https://github.com/boozallen/aissemble/pull/80/files), covered by our BAPL (not as permissive as, say, Apache) solely because we couldn't find an existing OSS solution that was up to date, maintained, and flexible.

yuchaoran2011 commented 5 months ago

I'm not sure if it's a good idea to have history server co-deployed with operator. A single history server can aggregate jobs managed by multiple Spark operator deployments across multiple k8s clusters

I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild.

I agree. I haven't looked at the quality of https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server, but if it's something you have used, I'm for that idea

peter-mcclonski commented 5 months ago

I'm not sure if it's a good idea to have history server co-deployed with operator. A single history server can aggregate jobs managed by multiple Spark operator deployments across multiple k8s clusters

I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild.

I agree. I haven't looked at the quality of https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server, but if it's something you have used, I'm for that idea

Sounds reasonable to me. Wrt the helm chart I linked, I wasn't sure if you had specific thoughts, given that you're listed as the maintainer on artifacthub

yuchaoran2011 commented 5 months ago

Ah upon a closer look, now I remember that I initially created this chart many years ago. I haven't used it for a long time though and won't count on it still being production ready

peter-mcclonski commented 5 months ago

I think there's both interest and clearly an unfilled need in the community for a production ready, standalone spark history chart that's well maintained. Would kubeflow and the spark operator maintainers be open to one being created in this repo, or would it be better housed somewhere totally separate?

KhASQ commented 5 months ago

Kindly make the spark history server part of the operator. I think targeting this operator as single point for spark on K8s eco system will add much better momentum for the development.

For example integrating spark operator to manage an external shuffle service on K8s.

Sorry for interrupting but I am so excited about the new development on this operator

ChenYi015 commented 3 months ago

I am also looking forward to a well maintained helm chart for spark history server, and I think maybe spark operator repo is the best place to host this chart. Would you @yuchaoran2011 mind me contributing a new helm chart based on this one https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server and put it under charts/spark-history-server.

ChenYi015 commented 3 months ago

I noticed that @vara-bonthu had maintained one helm chart for spark history server https://github.com/KubedAI/spark-history-server with support for S3. And I want to know what do you think about creating a new one for history server in this repo?

vara-bonthu commented 2 months ago

Spark History Server isn't directly tied to the Spark Operator project. It's usually deployed by users on Kubernetes, even if they don't use the Spark Operator. For example, users running spark-submit without the operator often set up the Spark History Server on their own. This is a separate deployment and, for large workloads, might need multiple replicas. So, it doesn't make sense to link it directly to the Spark Operator project.

If the community is interested, we could propose making the Spark History Server its own project. This could be under Kubeflow or Apache, focusing on multi-cloud and self-managed setups.

This PR can be moved to a new repo.

kubeflow / spark-operator