tafaust commented 2 years ago

Originates from necro-bumped thread #523 (https://github.com/jupyter-server/enterprise_gateway/issues/523#issuecomment-1126131413).

cc @kevin-bates I haven't had the time to test it >yet<. I will test everything on Sunday/Monday and report back. 😃

welcome[bot] commented 2 years ago

Thanks for submitting your first pull request! You are awesome! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly. welcome You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

tafaust commented 2 years ago

@kevin-bates Can you guide me with the failing pipelines? To me, the failing pipelines seem to be PR unrelated. Looking at the changed files, I cannot see what might cause this error.

The ConfigMap mounts are actually working on my side. I printed the contents of the /etc/spark/conf/spark-defaults.conf and it matched the contents of the ConfigMap.

tafaust commented 2 years ago

Successfully tested with:

Jupyter lab (client)

KERNEL_NAMESPACE='sparkapps' KERNEL_SPARKAPP_CONFIG_MAP='spark-operator-conf' KERNEL_SERVICE_ACCOUNT_NAME='spark' jupyter lab \
--gateway-url=http://enterprise-gateway.enterprise-gateway:8888 \
--GatewayClient.http_user=guest \
--GatewayClient.http_pwd=guest-password

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
name: spark-operator-conf
namespace: sparkapps  # Important: needs to be the same namespace as the SparkApplication CRD
data:
spark-defaults.conf: |
spark.driver.extraJavaOptions:                          "-Divy.cache.dir=/tmp -Divy.home=/tmp"
spark.jars.packages:                                    "org.apache.hadoop:hadoop-aws:3.2.3,org.apache.spark:spark-hadoop-cloud_2.13:3.2.1,com.amazonaws:aws-java-sdk-bundle:1.12.207"
spark.hadoop.fs.s3a.endpoint:                           "http://minio:9000"
spark.hadoop.fs.s3a.aws.credentials.provider:           "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
spark.hadoop.fs.s3a.access.key:                         "accesskey"
spark.hadoop.fs.s3a.secret.key:                         "secretkey"
spark.hadoop.fs.s3a.path.style.access:                  "true"
spark.hadoop.fs.s3a.impl:                               "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled: "true"

With a slightly modified version of the https://github.com/GoogleCloudPlatform/spark-on-k8s-operator (namespace, see KERNEL_NAMESPACE) and jupyter-enterprise-gateway with the changes of this PR. The ServiceAccount for a SparkApp CRD can be changed here and the name needs to be set in KERNEL_SERVICE_ACCOUNT_NAME.

tafaust commented 2 years ago

Regarding the pipelines... pip install ".[test]" works for me locally (Python 3.9, x64 linux) and thus cannot reproduce the error on my side.

$ python --version
Python 3.9.12
$ pip3 --version
pip 22.1 from /opt/miniconda3/envs/spark-k8s/lib/python3.9/site-packages/pip (python 3.9)

The pipeline error might possibly be related to: https://github.com/pypa/setuptools/issues/2204

kevin-bates commented 2 years ago

The pipeline error might possibly be related to: https://github.com/pypa/setuptools/issues/2204

@tahesse - thanks for the update! Yeah, I believe we'll be transitioning to hatch (from flit) to address the pip issue.

tafaust commented 2 years ago

I can add the whole spec for the custom SparkApplication CRD if individual kernel parameterization is in place.

Let me know if I can support you with individual parameterization.

tafaust commented 2 years ago

Some issue I encountered: https://github.com/jupyter-server/enterprise_gateway/pull/559 does not work with this PR in my local environment.

When I run

KERNEL_NAMESPACE='spark-apps' KERNEL_SERVICE_ACCOUNT_NAME='spark' jupyter lab \
  --gateway-url=http://enterprise-gateway.fml-jupyter:8888 \
  --GatewayClient.http_user=guest \
  --GatewayClient.http_pwd=guest-password \
  --GatewayClient.request_timeout=120.0

I receive the following in the kernel pod description:

MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-e4ef5180d7bbc664-conf-map-x-spark-apps-x-clustername" not found
Container image "elyra/kernel-spark-py:dev" already present on machine
Created container spark-kubernetes-driver
Started container spark-kubernetes-driver

and the kernel starts successfully.

When I start my local jupyterlab like this

KERNEL_SPARKAPP_CONFIG_MAP='jupyter-spark-operator' KERNEL_NAMESPACE='spark-apps' KERNEL_SERVICE_ACCOUNT_NAME='spark' jupyter lab \
  --gateway-url=http://enterprise-gateway.fml-jupyter:8888 \
  --GatewayClient.http_user=guest \
  --GatewayClient.http_pwd=guest-password \
  --GatewayClient.request_timeout=120.0

I end up with this pod description of the kernel pod:

MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-eedc1880d7c5906f-conf-map-x-spark-apps-x-clustername" not found
MountVolume.SetUp failed for volume "spark-configmap-volume" : configmap "jupyter-spark-operator-x-spark-apps-x-clustername" not found

The second line is new in this scenario and it fails to mount the ConfigMap in the driver. I am not sure if anyone else encountered this issue. I will dig deeper into this issue.

How to reproduce:

Check out this feature branch
Merge the feature branch from https://github.com/jupyter-server/enterprise_gateway/pull/559
make enterprise-gateway
push container to registry
test the scenarios from above

The SparkOperator does not show any errors in the logs. Might be related to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/714 but in my case the configmap volume mount times out and the kernel pod thus does not start. Digging deeper into this.

tafaust commented 2 years ago

@kevin-bates Please don't merge until I resolved the issue. Thank you. 😃

kevin-bates commented 2 years ago

@tahesse - would you mind merging with the current main branch so we can get CI passing? Thank you.

if individual kernel parameterization is in place.

Could you please clarify what you mean by this? I'm assuming it's related to https://github.com/jupyter-server/enterprise_gateway/issues/523#issuecomment-1126256677 but want to make sure.

The full parameterization topic affects the primary Jupyter ecosystem and is not relegated to Enterprise Gateway so has a much larger scope.

tafaust commented 2 years ago

@tahesse - would you mind merging with the current main branch so we can get CI passing?

Rebased.

if individual kernel parameterization is in place.

Could you please clarify what you mean by this? I'm assuming it's related to https://github.com/jupyter-server/enterprise_gateway/issues/523#issuecomment-1126256677 but want to make sure.

Yes, exactly. I was referring to your last paragraph. With the individual kernel parameterization in place, I'd like to add the whole SparkApplication spec for the operator.

The full parameterization topic affects the primary Jupyter ecosystem and is not relegated to Enterprise Gateway so has a much larger scope.

Thank you for the clarification!

kevin-bates commented 2 years ago

I'd like to add the whole SparkApplication spec for the operator.

That would be great - thank you. Please know that I'm not familiar with this aspect of things but others have used this and I'm hoping we can get their feedback.

tafaust commented 2 years ago

Please know that I'm not familiar with this aspect of things but others have used this and I'm hoping we can get their feedback.

I'm using it. That's also my motivation driver. Jupyter EG is a blessing and I'm happy to contribute. 😊

tafaust commented 2 years ago

So, I think I "fixed" the issue from https://github.com/jupyter-server/enterprise_gateway/pull/1087#issuecomment-1130167969. Somehow, the ConfigMap was created in the default namespace and thus could not be found by the default ServiceAccount of the sparkapps namesapce. Doh.

Other than that, I will link https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac for completeness.

@kevin-bates I think that this tiny PR can be merged safely.

kevin-bates commented 2 years ago

So, I think I "fixed" the issue from https://github.com/jupyter-server/enterprise_gateway/pull/1087#issuecomment-1130167969. Somehow, the ConfigMap was created in the default namespace and thus could not be found by the default ServiceAccount of the sparkapps namesapce. Doh.

Excellent!

Other than that, I will link https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac for completeness.

:+1:

@kevin-bates I think that this tiny PR can be merged safely.

Will do following your next commits.

tafaust commented 2 years ago

Just found that sparkConfigMap still has some issues. For reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/216

tafaust commented 2 years ago

Repositioned the docs entries as requested. For the sake of completeness, sparkConfigMap does only work if the fix mentioned in the link from my previous comment (https://github.com/jupyter-server/enterprise_gateway/pull/1087#issuecomment-1131859976) is applied. I will try to propose a PR for this issue with spark k8s operator.

kevin-bates commented 2 years ago

Hi @tahesse - thanks for the update. I'm still a little confused regarding the referenced issue and under what circumstances a user needs to do anything extra. The issue is very long and spans nearly 4 years.

I will try to propose a PR for this issue with spark k8s operator.

Will my confusion be clarified by the pending PR and, if so, does it make sense to hold this PR until it fully works. I apologize, but, at this point, I'm not sure if the changes in this PR work. Could you please clarify the current state of this, what works, what doesn't, etc.? (Thank you for your patience.)

tafaust commented 2 years ago

Hi @kevin-bates. Sorry, I didn't put up a lot of information in the first place.

What this PR provides

Specification of KERNEL_SPARKAPP_CONFIG_MAP supplies the sparkConfigMap spec of the {Scheduled,}SparkApplication CRD in Kubernetes.

Known limitations from spark-on-k8s-operator

The issue of the spark-on-k8s-operator repository mentions that (1) the SPARK_CONF_DIR is set to /etc/spark/conf/ and (2) the configMap from sparkConfigMap is mounted (mountPath) to SPARK_CONF_DIR.

Here is where the "fun" begins with this roughly 4yr old issue. 😅 There exists different solutions to the issue that spark doesn't make use of the mounted configMap. There exists images, PRs in non-upstream repos and so on. The various solutions are just not pushed to the upstream repository... I taught myself some Go and will try to provide a working PR for the spark-on-k8s-operator repository so that spark makes use of the config.

This PR works as intended. The limitation is merely on the spark operator side. FYI I use spark 3.2.1 with Scala 2.13. Please let me know how you want to proceed. Thank you!

kevin-bates commented 2 years ago

Excellent clarification - thank you. Yes, let's go ahead and merge this PR.

I taught myself some Go and will try to provide a working PR for the spark-on-k8s-operator repository so that spark makes use of the config.

I see. So this would be a PR in the upstream repo (that is based on Go), not necessarily here. Is that correct?

welcome[bot] commented 2 years ago

Congrats on your first merged pull request in this project! :tada: Thank you for contributing, we are very proud of you! :heart:

jupyter-server / enterprise_gateway

Add env var to specify a spark k8s operator configMap #1087

What this PR provides

Known limitations from spark-on-k8s-operator