[JEG vanilla on K8S]: Service token file does not exists error in jupyterhub

georghildebrand commented 4 years ago

Description

thank you all for this amazing work. I took some time today to try it out on a test kubernetes cluster (not supporting helm).

I basically went though the k8s docs and created a deployment (all fine so far). The enterprise gatewaypod is now running. in my jupyterhub notebook server i created the kernel.json and script folder. The new kernel shows up in my notebooks server. When i started i get the following error:

[D 2020-02-28 17:11:08.206 SingleUserLabApp log:174] 304 GET /user/ghildebrand/nbextensions/jupyter_dashboards/notebook/dashboard-view/view-menu.html?v=20200228123557 (ghildebrand@::ffff:10.2.9.0) 2.37ms
Traceback (most recent call last):
  File "/opt/conda/share/jupyter/kernels/python_kubernetes/scripts/launch_kubernetes.py", line 105, in <module>
    launch_kubernetes_kernel(kernel_id, response_addr, spark_context_init_mode)
  File "/opt/conda/share/jupyter/kernels/python_kubernetes/scripts/launch_kubernetes.py", line 32, in launch_kubernetes_kernel
    config.load_incluster_config()
  File "/opt/conda/lib/python3.7/site-packages/kubernetes/config/incluster_config.py", line 96, in load_incluster_config
    cert_filename=SERVICE_CERT_FILENAME).load_and_set()
  File "/opt/conda/lib/python3.7/site-packages/kubernetes/config/incluster_config.py", line 47, in load_and_set
    self._load_config()
  File "/opt/conda/lib/python3.7/site-packages/kubernetes/config/incluster_config.py", line 64, in _load_config
    raise ConfigException("Service token file does not exists.")
kubernetes.config.config_exception.ConfigException: Service token file does not exists.

I thought the notebooks server does not need operator permission or so?? For sure i am mixing up something. Any hint welcome.

Environment

Enterprise Gateway Version: latest (via pip install / yaml)
Platform: k8s

lresende commented 4 years ago

There have been some changes around this area in the Notebook that might just now being propagated around Hub and causing this, but I will have to test it further to see if it's really a side effect of that. I will update here with our findings.

georghildebrand commented 4 years ago

@lresende thanks for having a look.

Another point that is not clear for me is how the notebook server will know the EG_RESPONSE_IP. What i see from code this is fetched from os.environ. But how to make it autodetect?

This is my JEG deployment.yaml and some notes:

i am not allowed to use deamonset but i think thats ok.
i build the kernel specs locally and copied the kernel.json, script etc manually to the hub notebook server (for testing). As i said that raised the above error.

# This file defines the Kubernetes objects necessary for Enterprise Gateway to run within Kubernetes.
#
apiVersion: v1
kind: Namespace
metadata:
  name: enterprise-gateway
  labels:
    app: enterprise-gateway
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: enterprise-gateway-sa
  namespace: enterprise-gateway
  labels:
    app: enterprise-gateway
    component: enterprise-gateway
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: enterprise-gateway-controller
  labels:
    app: enterprise-gateway
    component: enterprise-gateway
rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces", "services", "configmaps", "secrets", "persistentvolumes", "persistentvolumeclaims"]
    verbs: ["get", "watch", "list", "create", "delete"]
  - apiGroups: ["rbac.authorization.k8s.io"]
    resources: ["rolebindings"]
    verbs: ["get", "list", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  # Referenced by EG_KERNEL_CLUSTER_ROLE below
  name: kernel-controller
  labels:
    app: enterprise-gateway
    component: kernel
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: enterprise-gateway-controller
  labels:
    app: enterprise-gateway
    component: enterprise-gateway
subjects:
  - kind: ServiceAccount
    name: enterprise-gateway-sa
    namespace: enterprise-gateway
roleRef:
  kind: ClusterRole
  name: enterprise-gateway-controller
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: enterprise-gateway
    component: enterprise-gateway
  name: enterprise-gateway
  namespace: enterprise-gateway
spec:
  ports:
  - name: gateway-port
    port: 8888
    targetPort: 8888
  selector:
    gateway-selector: enterprise-gateway
  sessionAffinity: ClientIP
  type: NodePort
# Uncomment in order to use <k8s-master>:8888
#  externalIPs:
#  - k8s-master-public-ip
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: enterprise-gateway
  namespace: enterprise-gateway
  labels:
    gateway-selector: enterprise-gateway
    app: enterprise-gateway
    component: enterprise-gateway
spec:
# Uncomment/Update to deploy multiple replicas of EG
#  replicas: 1
  selector:
    matchLabels:
      gateway-selector: enterprise-gateway
  template:
    metadata:
      labels:
        gateway-selector: enterprise-gateway
        app: enterprise-gateway
        component: enterprise-gateway
    spec:
      # Created above.
      serviceAccountName: enterprise-gateway-sa
      containers:
      - env:
        - name: EG_PORT
          value: "8888"

          # Created above.
        - name: EG_NAMESPACE
          value: "enterprise-gateway"

          # Created above.  Used if no KERNEL_NAMESPACE is provided by client.
        - name: EG_KERNEL_CLUSTER_ROLE
          value: "kernel-controller"

          # All kernels reside in the EG namespace if True, otherwise KERNEL_NAMESPACE
          # must be provided or one will be created for each kernel.
        - name: EG_SHARED_NAMESPACE
          value: "False"

          # NOTE: This requires appropriate volume mounts to make notebook dir accessible
        - name: EG_MIRROR_WORKING_DIRS
          value: "False"

          # Current idle timeout is 1 hour.
        - name: EG_CULL_IDLE_TIMEOUT
          value: "3600"

        - name: EG_LOG_LEVEL
          value: "DEBUG"

        - name: EG_KERNEL_LAUNCH_TIMEOUT
          value: "60"

        - name: EG_KERNEL_WHITELIST
          value: "['r_kubernetes','python_kubernetes','python_tf_kubernetes','python_tf_gpu_kubernetes','scala_kubernetes','spark_r_kubernetes','spark_python_kubernetes','spark_scala_kubernetes']"

        # Ensure the following VERSION tag is updated to the version of Enterprise Gateway you wish to run
        image: elyra/enterprise-gateway:dev
        # Use IfNotPresent policy so that dev-based systems don't automatically
        # update. This provides more control.  Since formal tags will be release-specific
        # this policy should be sufficient for them as well.
        imagePullPolicy: IfNotPresent
        name: enterprise-gateway
        resources:
          requests:
            cpu: "2000m"
            memory: "4Gi"
          limits:
            cpu: "2000m"
            memory: "4Gi"
        ports:
        - containerPort: 8888
          name: gateway-port
          protocol: TCP
## Uncomment to enable NFS-mounted kernelspecs
#        volumeMounts:
#        - name: kernelspecs
#          mountPath: "/usr/local/share/jupyter/kernels"
#      volumes:
#      - name: kernelspecs
#        nfs:
#          server: <internal-ip-of-nfs-server>
#          path: "/usr/local/share/jupyter/kernels"
---
# apiVersion: apps/v1
# kind: DaemonSet
# metadata:
#   name: kernel-image-puller
#   namespace: enterprise-gateway
# spec:
#   selector:
#     matchLabels:
#       name: kernel-image-puller 
#   template:
#     metadata:
#       labels:
#         name: kernel-image-puller 
#         app: enterprise-gateway
#         component: kernel-image-puller
#     spec:
#       containers:
#       - name: kernel-image-puller 
#         image: elyra/kernel-image-puller:dev
#         env:
#           - name: KIP_GATEWAY_HOST
#             value: "http://enterprise-gateway.enterprise-gateway:8888"
#           - name: KIP_INTERVAL
#             value: "300"
#           - name: KIP_PULL_POLICY
#             value: "IfNotPresent"
#         volumeMounts:
#           - name: dockersock
#             mountPath: "/var/run/docker.sock"
#       volumes:
#       - name: dockersock
#         hostPath:
#           path: /var/run/docker.sock

georghildebrand commented 4 years ago

This issue can be closed, i realized that i had to use different env vars for connecting to the kernel. However, i don't know why it was trying to use tokens ...

kevin-bates commented 4 years ago

Thanks for working through this @georghildebrand. I wanted to respond to the EG_RESPONSE_IP question.

EG_RESPONSE_IP only applies to the interactions between EG and the launched kernel pod. Notebook doesn't come into play here. This environment variable is set prior to starting EG in cases where EG and the cluster that its launching kernels against has some kind of firewall or the specific local IP is not appropriate when used from the cluster on which the kernel lands. It is rarely used.

This value is used when constructing the EG_RESPONSE_ADDRESS environment variable. The EG_RESPONSE_IP is prepended to a port that EG listens on immediately following the kernel's launch. If EG_RESPONSE_IP is None, the EG server's local IP is used. The EG_RESPONSE_ADDRESS is conveyed to the launched kernel via the environment for containerized kernel launches, or as an argument to the kernel launcher for non-containerized launches. Its this response address to which the launched kernel sends its ZMQ port information, etc. EG then "connects" its kernel manager to these returned ports and steps out of the way, letting EG serve as a proxy between Notebook and the remote kernel.

I hope that helps.

georghildebrand commented 4 years ago

@kevin-bates thanks for clarification! much appreciated.

lucabem commented 4 years ago

Hi @georghildebrand! - How did you solved it? Sometimes I get the same issue

georghildebrand commented 4 years ago

@lucabem I used mainly the above mentioned env vars. I think as the lib uses k8s client if these are not present correctly it tries token based auth or so. Sadly I'm on mobile only otherwise I would post my Manifest that worked out

jupyter-server / enterprise_gateway

[JEG vanilla on K8S]: Service token file does not exists error in jupyterhub #785

Description

Environment