kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

Spark Operator Roadmap 2024 #2193

Open ChenYi015 opened 1 month ago

ChenYi015 commented 1 month ago

Roadmap

Creating this roadmap issue to track work items that we will do in the future. If you have any ideas, please leave a comment.

Features

Chores

jacobsalway commented 1 month ago

Some ideas:

Chores:

cccsss01 commented 1 month ago

Upgrade default security posture Remove reliance on userid 185 (seems it's connected to the krb5.conf file leveraging domains and realms of institutions that may not need it).

josecsotomorales commented 1 month ago

@jacobsalway @ChenYi015 I think that "Deprecate the need for a mutating webhook by moving all functionality into the pod template" should be a top priority, especially with the upcoming release of Spark v4

gangahiremath commented 1 month ago

@bnetzi , @vara-bonthu , regarding the point 'referring you to the discussion here, I think we just need to provide in general more options to configure the controller runtime, and that my PR is irrelevant',

Does it mean ' one queue per app and one go routine per app'(https://github.com/kubeflow/spark-operator/pull/1990) is not a solution for the performance issue faced?

Is https://github.com/kubeflow/spark-operator/pull/2186 solution for the same?

Do we see performance opportunity improvement with the approach that we have tried? - https://github.com/kubeflow/spark-operator/issues/1574#issuecomment-1699668815 Summary of changes :

c-h-afzal commented 1 month ago

@gangahiremath - I think the two improvements aren't mutually exclusive - Given the testing done by @bnetzi and captured in this document - it seems that the one mutex per queue does have performance benefits. I also think that using Go instead of Java based submission can also help reduce job submission latency. However, as pointed out by @bnetz using Go would require corresponding changes to spark operator whenever there are changes to spark-submit and may also introduce functionality gaps. We can probably include both improvements in the roadmap if the performance hit from JVM is significant enough.

It would be great if other users can share/comment if JVM spin-up times indeed were a contributor to job submission latency? Also, if anyone tweaked/optimized JVM specifically to alleviate this pain point? Thanks.

gangahiremath commented 1 month ago

@gangahiremath - I think the two improvements aren't mutually exclusive - Given the testing done by @bnetzi and captured in this document - it seems that the one mutex per queue does have performance benefits. I also think that using Go instead of Java based submission can also help reduce job submission latency. However, as pointed out by @bnetz using Go would require corresponding changes to spark operator whenever there are changes to spark-submit and may also introduce functionality gaps. We can probably include both improvements in the roadmap if the performance hit from JVM is significant enough.

It would be great if other users can share/comment if JVM spin-up times indeed were a contributor to job submission latency? Also, if anyone tweaked/optimized JVM specifically to alleviate this pain point? Thanks.

@c-h-afzal , FYI point So the way I see it - work queue per app might not longer be the solution by bnetzi in thread https://github.com/kubeflow/spark-operator/pull/1990#issuecomment-2412950198.