All admin users' roles have been removed

dasantonym commented 1 year ago

Bug description

The system removed all admin roles for each user that was previously an admin. Now nobody can administer the server anymore. This was already flaky before and sometimes the admin roles came back after a hub restart, but this is no longer the case.

Expected behaviour

User akoch (also set in the config file under admin users) logs in, log file says:

[I 2023-04-05 08:27:15.702 JupyterHub generic:185] Validating if user claim groups match any of ['staff', 'student']
[I 2023-04-05 08:27:15.710 JupyterHub roles:257] Removing role admin for User: akoch
[I 2023-04-05 08:27:15.720 JupyterHub base:819] User logged in: akoch

User admin logs in:

[I 2023-04-05 08:34:39.882 JupyterHub generic:185] Validating if user claim groups match any of ['staff', 'student']
[I 2023-04-05 08:34:39.889 JupyterHub roles:238] Adding role user for User: admin
[I 2023-04-05 08:34:39.899 JupyterHub base:819] User logged in: admin

Seems that the admin had their role removed earlier.

Actual behaviour

Admins listed as admin users in the config under hub.config.Authenticator.admin_users (we're using Zero2Jupyterhub) stay admins.

How to reproduce

I am not sure how to reproduce this as we experienced this quite randomly. One thing is that we just recently upgraded the Helm Chart installation from app version 2.x to 3.1.1. There is a new RBAC system in place and maybe the migration went wrong? Before the upgrade we regularly lost admin roles but after a restart of the Hub deployment they usually came back, which is no longer the case.

Since we are using the OAuth authenticator, does there need to be something configured in the users' claims coming from Keycloak?

Your personal set up

We're using zero-to-jupyterhub chart version jupyterhub-2.0.1-0.dev.git.6026.h0e7347d7.

OS: n.a.
Version(s): Jupyterhub 3.1.1

We are using the SQLite DB, if that is of any relevance here. I can also attach the configuration, but we don't have any special things set up except for the external authentication and the admin_users listing.

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

manics commented 1 year ago

Hi! Please can you show us your full Z2JH config?

dasantonym commented 1 year ago

Hey, thanks for the speedy response!

Here's the config:

# fullnameOverride and nameOverride distinguishes blank strings, null values,
# and non-blank strings. For more details, see the configuration reference.
fullnameOverride: ""
nameOverride:

# custom can contain anything you want to pass to the hub pod, as all passed
# Helm template values will be made available there.
custom: {}

# imagePullSecret is configuration to create a k8s Secret that Helm chart's pods
# can get credentials from to pull their images.
imagePullSecret:
  create: false
  automaticReferenceInjection: true
  registry:
  username:
  password:
  email:
# imagePullSecrets is configuration to reference the k8s Secret resources the
# Helm chart's pods can get credentials from to pull their images.
imagePullSecrets: []

# hub relates to the hub pod, responsible for running JupyterHub, its configured
# Authenticator class KubeSpawner, and its configured Proxy class
# ConfigurableHTTPProxy. KubeSpawner creates the user pods, and
# ConfigurableHTTPProxy speaks with the actual ConfigurableHTTPProxy server in
# the proxy pod.
hub:
  revisionHistoryLimit:
  config:
    Authenticator:
      admin_users:
        - REDACTED
        - REDACTED
        - REDACTED
        - REDACTED
        - REDACTED
        - REDACTED
      enable_auth_state: true
      allowed_groups:
        - staff
        - student
      userdata_params:
        state: state
      scope:
        - profile
        - roles
        - openid
    GenericOAuthenticator:
      client_id: jupyterhub
      client_secret: REDACTED
      oauth_callback_url: REDACTED
      authorize_url: REDACTED
      token_url: REDACTED
      userdata_url: REDACTED
      logout_redirect_url: REDACTED
      login_service: keycloak
      auto_login: true
      username_key: preferred_username
      claim_groups_key: groups
    JupyterHub:
      authenticator_class: generic-oauth
  service:
    type: ClusterIP
    annotations: {}
    ports:
      nodePort:
    extraPorts: []
    loadBalancerIP:
  baseUrl: /
  cookieSecret:
  initContainers: []
  nodeSelector:
    node-role.kubernetes.io/control-plane: ""
  tolerations: []
  concurrentSpawnLimit: 64
  consecutiveFailureLimit: 5
  activeServerLimit:
  deploymentStrategy:
    ## type: Recreate
    ## - sqlite-pvc backed hubs require the Recreate deployment strategy as a
    ##   typical PVC storage can only be bound to one pod at the time.
    ## - JupyterHub isn't designed to support being run in parallell. More work
    ##   needs to be done in JupyterHub itself for a fully highly available (HA)
    ##   deployment of JupyterHub on k8s is to be possible.
    type: Recreate
  db:
    type: sqlite-pvc
    upgrade:
    pvc:
      annotations: {}
      selector: {}
      accessModes:
        - ReadWriteOnce
      storage: 1Gi
      subPath:
      storageClassName: openebs-zfspv-ctrl-b
    url:
    password:
  labels: {}
  annotations: {}
  command: []
  args: []
  extraConfig:
    auth_state_hook: |
      def userdata_hook(spawner, auth_state):
          spawner.oauth_user = auth_state["oauth_user"] if auth_state else { 'groups': [] }

      c.KubeSpawner.auth_state_hook = userdata_hook
    options_form: |
          # Profile list code REDACTED
          profile_list = []
          self.profile_list = profile_list

          # NOTE: We let KubeSpawner inspect profile_list and decide what to
          #       return, it will return a falsy blank string if there is no
          #       profile_list, which makes no options form be presented.
          #
          # ref: https://github.com/jupyterhub/kubespawner/blob/37a80abb0a6c826e5c118a068fa1cf2725738038/kubespawner/spawner.py#L1885-L1935
          #
          return self._options_form_default()

      c.KubeSpawner.options_form = dynamic_options_form
  extraFiles: {}
  extraEnv: {}
  extraContainers: []
  extraVolumes: []
  extraVolumeMounts: []
  image:
    name: jupyterhub/k8s-hub
    tag: "2.0.1-0.dev.git.6026.h0e7347d7"
    pullPolicy:
    pullSecrets: []
  resources: {}
  podSecurityContext:
    fsGroup: 1000
  containerSecurityContext:
    runAsUser: 1000
    runAsGroup: 1000
    allowPrivilegeEscalation: false
  lifecycle: {}
  loadRoles: {}
  services: {}
  pdb:
    enabled: false
    maxUnavailable:
    minAvailable: 1
  networkPolicy:
    enabled: true
    ingress: []
    egress: []
    egressAllowRules:
      cloudMetadataServer: true
      dnsPortsPrivateIPs: true
      nonPrivateIPs: true
      privateIPs: true
    interNamespaceAccessLabels: ignore
    allowedIngressPorts: []
  allowNamedServers: false
  namedServerLimitPerUser:
  authenticatePrometheus:
  redirectToServer:
  shutdownOnLogout:
  templatePaths: []
  templateVars: {}
  livenessProbe:
    # The livenessProbe's aim to give JupyterHub sufficient time to startup but
    # be able to restart if it becomes unresponsive for ~5 min.
    enabled: true
    initialDelaySeconds: 300
    periodSeconds: 10
    failureThreshold: 30
    timeoutSeconds: 3
  readinessProbe:
    # The readinessProbe's aim is to provide a successful startup indication,
    # but following that never become unready before its livenessProbe fail and
    # restarts it if needed. To become unready following startup serves no
    # purpose as there are no other pod to fallback to in our non-HA deployment.
    enabled: true
    initialDelaySeconds: 0
    periodSeconds: 2
    failureThreshold: 1000
    timeoutSeconds: 1
  existingSecret:
  serviceAccount:
    create: true
    name:
    annotations: {}
  extraPodSpec: {}

rbac:
  create: true

# proxy relates to the proxy pod, the proxy-public service, and the autohttps
# pod and proxy-http service.
proxy:
  secretToken:
  annotations: {}
  deploymentStrategy:
    ## type: Recreate
    ## - JupyterHub's interaction with the CHP proxy becomes a lot more robust
    ##   with this configuration. To understand this, consider that JupyterHub
    ##   during startup will interact a lot with the k8s service to reach a
    ##   ready proxy pod. If the hub pod during a helm upgrade is restarting
    ##   directly while the proxy pod is making a rolling upgrade, the hub pod
    ##   could end up running a sequence of interactions with the old proxy pod
    ##   and finishing up the sequence of interactions with the new proxy pod.
    ##   As CHP proxy pods carry individual state this is very error prone. One
    ##   outcome when not using Recreate as a strategy has been that user pods
    ##   have been deleted by the hub pod because it considered them unreachable
    ##   as it only configured the old proxy pod but not the new before trying
    ##   to reach them.
    type: Recreate
    ## rollingUpdate:
    ## - WARNING:
    ##   This is required to be set explicitly blank! Without it being
    ##   explicitly blank, k8s will let eventual old values under rollingUpdate
    ##   remain and then the Deployment becomes invalid and a helm upgrade would
    ##   fail with an error like this:
    ##
    ##     UPGRADE FAILED
    ##     Error: Deployment.apps "proxy" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'
    ##     Error: UPGRADE FAILED: Deployment.apps "proxy" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'
    rollingUpdate:
  # service relates to the proxy-public service
  service:
    type: NodePort
    labels: {}
    annotations: {}
    nodePorts:
      http: 30080
      https: 30443
    disableHttpPort: false
    extraPorts: []
    loadBalancerIP:
    loadBalancerSourceRanges: []
  # chp relates to the proxy pod, which is responsible for routing traffic based
  # on dynamic configuration sent from JupyterHub to CHP's REST API.
  chp:
    revisionHistoryLimit:
    containerSecurityContext:
      runAsUser: 65534 # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      name: jupyterhub/configurable-http-proxy
      # tag is automatically bumped to new patch versions by the
      # watch-dependencies.yaml workflow.
      #
      tag: "4.5.4" # https://github.com/jupyterhub/configurable-http-proxy/tags
      pullPolicy:
      pullSecrets: []
    extraCommandLineFlags: []
    livenessProbe:
      enabled: true
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 30
      timeoutSeconds: 3
    readinessProbe:
      enabled: true
      initialDelaySeconds: 0
      periodSeconds: 2
      failureThreshold: 1000
      timeoutSeconds: 1
    resources: {}
    defaultTarget:
    errorTarget:
    extraEnv: {}
    nodeSelector:
      node-role.kubernetes.io/control-plane: ""
    tolerations: []
    networkPolicy:
      enabled: true
      ingress: []
      egress: []
      egressAllowRules:
        cloudMetadataServer: true
        dnsPortsPrivateIPs: true
        nonPrivateIPs: true
        privateIPs: true
      interNamespaceAccessLabels: ignore
      allowedIngressPorts: [http, https]
    pdb:
      enabled: false
      maxUnavailable:
      minAvailable: 1
    extraPodSpec: {}
  # traefik relates to the autohttps pod, which is responsible for TLS
  # termination when proxy.https.type=letsencrypt.
  traefik:
    revisionHistoryLimit:
    containerSecurityContext:
      runAsUser: 65534 # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      name: traefik
      # tag is automatically bumped to new patch versions by the
      # watch-dependencies.yaml workflow.
      #
      tag: "v2.9.8" # ref: https://hub.docker.com/_/traefik?tab=tags
      pullPolicy:
      pullSecrets: []
    hsts:
      includeSubdomains: false
      preload: false
      maxAge: 15724800 # About 6 months
    resources: {}
    labels: {}
    extraInitContainers: []
    extraEnv: {}
    extraVolumes: []
    extraVolumeMounts: []
    extraStaticConfig: {}
    extraDynamicConfig: {}
    nodeSelector:
      node-role.kubernetes.io/control-plane: ""
    tolerations: []
    extraPorts: []
    networkPolicy:
      enabled: true
      ingress: []
      egress: []
      egressAllowRules:
        cloudMetadataServer: true
        dnsPortsPrivateIPs: true
        nonPrivateIPs: true
        privateIPs: true
      interNamespaceAccessLabels: ignore
      allowedIngressPorts: [http, https]
    pdb:
      enabled: false
      maxUnavailable:
      minAvailable: 1
    serviceAccount:
      create: true
      name:
      annotations: {}
    extraPodSpec: {}
  secretSync:
    containerSecurityContext:
      runAsUser: 65534 # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      name: jupyterhub/k8s-secret-sync
      tag: "2.0.1-0.dev.git.6000.h2ae7e032"
      pullPolicy:
      pullSecrets: []
    resources: {}
  labels: {}
  https:
    enabled: false
    type: letsencrypt
    #type: letsencrypt, manual, offload, secret
    letsencrypt:
      contactEmail:
      # Specify custom server here (https://acme-staging-v02.api.letsencrypt.org/directory) to hit staging LE
      acmeServer: https://acme-v02.api.letsencrypt.org/directory
    manual:
      key:
      cert:
    secret:
      name:
      key: tls.key
      crt: tls.crt
    hosts: []

# singleuser relates to the configuration of KubeSpawner which runs in the hub
# pod, and its spawning of user pods such as jupyter-myusername.
singleuser:
  podNameTemplate:
  extraTolerations: []
  nodeSelector:
    node-role.kubernetes.io/worker: worker
  extraNodeAffinity:
    required: []
    preferred: []
  extraPodAffinity:
    required: []
    preferred: []
  extraPodAntiAffinity:
    required: []
    preferred: []
  networkTools:
    image:
      name: jupyterhub/k8s-network-tools
      tag: "2.0.1-0.dev.git.6000.h3053d41b"
      pullPolicy:
      pullSecrets: []
    resources: {}
  cloudMetadata:
    # block set to true will append a privileged initContainer using the
    # iptables to block the sensitive metadata server at the provided ip.
    blockWithIptables: true
    ip: 169.254.169.254
  networkPolicy:
    enabled: true
    ingress: []
    egress: []
    egressAllowRules:
      cloudMetadataServer: false
      dnsPortsPrivateIPs: true
      nonPrivateIPs: true
      privateIPs: false
    interNamespaceAccessLabels: ignore
    allowedIngressPorts:
      - 8000
      - 7680
  events: true
  extraAnnotations: {}
  extraLabels:
    hub.jupyter.org/network-access-hub: "true"
  extraFiles: {}
  extraEnv: {}
  lifecycleHooks: {}
  initContainers: []
  extraContainers: []
  allowPrivilegeEscalation: false
  uid: 1000
  fsGid: 100
  serviceAccountName:
  storage:
    type: dynamic
    extraLabels: {}
    extraVolumes:
      - name: shm-volume
        emptyDir:
          medium: Memory
    extraVolumeMounts:
      - name: shm-volume
        mountPath: /dev/shm
    static:
      pvcName:
      subPath: "{username}"
    capacity: 200G
    homeMountPath: /home/jovyan
    dynamic:
      storageClass: longhorn
      pvcNameTemplate: claim-{username}{servername}
      volumeNameTemplate: volume-{username}{servername}
      storageAccessModes: [ReadWriteOnce]
  image:
    name: registry.kitegg.de/library/kitegg-singleuser
    tag: cuda-11.8-devel
    pullPolicy: Always
    pullSecrets: []
  startTimeout: 120
  cpu:
    limit: 32
    guarantee: 4
  memory:
    limit: 256G
    guarantee: 32G
  extraResource:
    limits: {}
    guarantees: {}
  cmd: jupyterhub-singleuser
  defaultUrl: "/lab"
  extraPodConfig:
    securityContext:
      fsGroup: 100
      fsGroupChangePolicy: "OnRootMismatch"
  profileList: []

# scheduling relates to the user-scheduler pods and user-placeholder pods.
scheduling:
  userScheduler:
    enabled: true
    revisionHistoryLimit:
    replicas: 2
    logLevel: 4
    # plugins are configured on the user-scheduler to make us score how we
    # schedule user pods in a way to help us schedule on the most busy node. By
    # doing this, we help scale down more effectively. It isn't obvious how to
    # enable/disable scoring plugins, and configure them, to accomplish this.
    #
    # plugins ref: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins-1
    # migration ref: https://kubernetes.io/docs/reference/scheduling/config/#scheduler-configuration-migrations
    #
    plugins:
      score:
        # These scoring plugins are enabled by default according to
        # https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins
        # 2022-02-22.
        #
        # Enabled with high priority:
        # - NodeAffinity
        # - InterPodAffinity
        # - NodeResourcesFit
        # - ImageLocality
        # Remains enabled with low default priority:
        # - TaintToleration
        # - PodTopologySpread
        # - VolumeBinding
        # Disabled for scoring:
        # - NodeResourcesBalancedAllocation
        #
        disabled:
          # We disable these plugins (with regards to scoring) to not interfere
          # or complicate our use of NodeResourcesFit.
          - name: NodeResourcesBalancedAllocation
          # Disable plugins to be allowed to enable them again with a different
          # weight and avoid an error.
          - name: NodeAffinity
          - name: InterPodAffinity
          - name: NodeResourcesFit
          - name: ImageLocality
        enabled:
          - name: NodeAffinity
            weight: 14631
          - name: InterPodAffinity
            weight: 1331
          - name: NodeResourcesFit
            weight: 121
          - name: ImageLocality
            weight: 11
    pluginConfig:
      # Here we declare that we should optimize pods to fit based on a
      # MostAllocated strategy instead of the default LeastAllocated.
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
            type: MostAllocated
    containerSecurityContext:
      runAsUser: 65534 # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      # IMPORTANT: Bumping the minor version of this binary should go hand in
      #            hand with an inspection of the user-scheduelrs RBAC resources
      #            that we have forked in
      #            templates/scheduling/user-scheduler/rbac.yaml.
      #
      #            Debugging advice:
      #
      #            - Is configuration of kube-scheduler broken in
      #              templates/scheduling/user-scheduler/configmap.yaml?
      #
      #            - Is the kube-scheduler binary's compatibility to work
      #              against a k8s api-server that is too new or too old?
      #
      #            - You can update the GitHub workflow that runs tests to
      #              include "deploy/user-scheduler" in the k8s namespace report
      #              and reduce the user-scheduler deployments replicas to 1 in
      #              dev-config.yaml to get relevant logs from the user-scheduler
      #              pods. Inspect the "Kubernetes namespace report" action!
      #
      #            - Typical failures are that kube-scheduler fails to search for
      #              resources via its "informers", and won't start trying to
      #              schedule pods before they succeed which may require
      #              additional RBAC permissions or that the k8s api-server is
      #              aware of the resources.
      #
      #            - If "successfully acquired lease" can be seen in the logs, it
      #              is a good sign kube-scheduler is ready to schedule pods.
      #
      name: registry.k8s.io/kube-scheduler
      # tag is automatically bumped to new patch versions by the
      # watch-dependencies.yaml workflow. The minor version is pinned in the
      # workflow, and should be updated there if a minor version bump is done
      # here.
      #
      tag: "v1.25.7" # ref: https://github.com/kubernetes/website/blob/main/content/en/releases/patch-releases.md
      pullPolicy:
      pullSecrets: []
    nodeSelector: {}
    tolerations: []
    labels: {}
    annotations: {}
    pdb:
      enabled: true
      maxUnavailable: 1
      minAvailable:
    resources: {}
    serviceAccount:
      create: true
      name:
      annotations: {}
    extraPodSpec: {}
  podPriority:
    enabled: false
    globalDefault: false
    defaultPriority: 0
    imagePullerPriority: -5
    userPlaceholderPriority: -10
  userPlaceholder:
    enabled: true
    image:
      name: registry.k8s.io/pause
      # tag is automatically bumped to new patch versions by the
      # watch-dependencies.yaml workflow.
      #
      # If you update this, also update prePuller.pause.image.tag
      #
      tag: "3.9"
      pullPolicy:
      pullSecrets: []
    revisionHistoryLimit:
    replicas: 0
    labels: {}
    annotations: {}
    containerSecurityContext:
      runAsUser: 65534 # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    resources: {}
  corePods:
    tolerations:
      - key: hub.jupyter.org/dedicated
        operator: Equal
        value: core
        effect: NoSchedule
      - key: hub.jupyter.org_dedicated
        operator: Equal
        value: core
        effect: NoSchedule
    nodeAffinity:
      matchNodePurpose: prefer
  userPods:
    tolerations:
      - key: hub.jupyter.org/dedicated
        operator: Equal
        value: user
        effect: NoSchedule
      - key: hub.jupyter.org_dedicated
        operator: Equal
        value: user
        effect: NoSchedule
    nodeAffinity:
      matchNodePurpose: prefer

# prePuller relates to the hook|continuous-image-puller DaemonsSets
prePuller:
  revisionHistoryLimit:
  labels: {}
  annotations: {}
  resources: {}
  containerSecurityContext:
    runAsUser: 65534 # nobody user
    runAsGroup: 65534 # nobody group
    allowPrivilegeEscalation: false
  extraTolerations: []
  # hook relates to the hook-image-awaiter Job and hook-image-puller DaemonSet
  hook:
    enabled: true
    pullOnlyOnChanges: true
    # image and the configuration below relates to the hook-image-awaiter Job
    image:
      name: jupyterhub/k8s-image-awaiter
      tag: "2.0.1-0.dev.git.5866.h7de20b77"
      pullPolicy:
      pullSecrets: []
    containerSecurityContext:
      runAsUser: 65534 # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    podSchedulingWaitDuration: 10
    nodeSelector: {}
    tolerations: []
    resources: {}
    serviceAccount:
      create: true
      name:
      annotations: {}
  continuous:
    enabled: true
  pullProfileListImages: true
  extraImages: {}
  pause:
    containerSecurityContext:
      runAsUser: 65534 # nobody user
      runAsGroup: 65534 # nobody group
      allowPrivilegeEscalation: false
    image:
      name: registry.k8s.io/pause
      # tag is automatically bumped to new patch versions by the
      # watch-dependencies.yaml workflow.
      #
      # If you update this, also update scheduling.userPlaceholder.image.tag
      #
      tag: "3.9"
      pullPolicy:
      pullSecrets: []

ingress:
  enabled: false
  annotations: {}
  ingressClassName:
  hosts: []
  pathSuffix:
  pathType: Prefix
  tls: []

# cull relates to the jupyterhub-idle-culler service, responsible for evicting
# inactive singleuser pods.
#
# The configuration below, except for enabled, corresponds to command-line flags
# for jupyterhub-idle-culler as documented here:
# https://github.com/jupyterhub/jupyterhub-idle-culler#as-a-standalone-script
#
cull:
  enabled: false
  users: false # --cull-users
  adminUsers: false # --cull-admin-users
  removeNamedServers: false # --remove-named-servers
  timeout: 3600 # --timeout
  every: 600 # --cull-every
  concurrency: 10 # --concurrency
  maxAge: 0 # --max-age

debug:
  enabled: false

global:
  safeToShowValues: false

consideRatio commented 1 year ago

@dasantonym it looks like you have a copy of the entire default values. When the helm chart's default values change, they won't change for you as they are copy-pasted since some point in time from the past I presume. This can easily become a cause of issues even if it may not be the cause of this, it will make looking into this more complicated.

Can you try to trim down your config to what you actually have intentioanlly changed going onwards, and in the very short term describe from what chart version or point in time these values are copied originally?

dasantonym commented 1 year ago

I know, it's basically just for keeping track of the available options without having to look them up. The file is completely replaced with each update of the Helm chart and the customized values are merged back in.

I can post a diff from the default config so you can see what was changed. The chart version is jupyterhub-2.0.1-0.dev.git.6026.h0e7347d7.

dasantonym commented 1 year ago

Here's a diff, does this help?

30a31,60
>     Authenticator:
>       admin_users:
>         - REDACTED
>         - REDACTED
>         - REDACTED
>         - REDACTED
>         - REDACTED
>         - REDACTED
>       enable_auth_state: true
>       allowed_groups:
>         - staff
>         - student
>       userdata_params:
>         state: state
>       scope:
>         - profile
>         - roles
>         - openid
>     GenericOAuthenticator:
>       client_id: jupyterhub
>       client_secret: REDACTED
>       oauth_callback_url: REDACTED
>       authorize_url: REDACTED
>       token_url: REDACTED
>       userdata_url: REDACTED
>       logout_redirect_url: REDACTED
>       login_service: keycloak
>       auto_login: true
>       username_key: preferred_username
>       claim_groups_key: groups
32,33c62
<       admin_access: true
<       authenticator_class: dummy
---
>       authenticator_class: generic-oauth
44c73,74
<   nodeSelector: {}
---
>   nodeSelector:
>     node-role.kubernetes.io/control-plane: ""
67c97
<       storageClassName:
---
>       storageClassName: openebs-zfspv-ctrl-b
74c104,123
<   extraConfig: {}
---
>   extraConfig:
>     auth_state_hook: |
>       def userdata_hook(spawner, auth_state):
>           spawner.oauth_user = auth_state["oauth_user"] if auth_state else { 'groups': [] }
> 
>       c.KubeSpawner.auth_state_hook = userdata_hook
>     options_form: |
>       # Profile list code REDACTED
>       profile_list = []
>       self.profile_list = profile_list
>       
>       # NOTE: We let KubeSpawner inspect profile_list and decide what to
>       #       return, it will return a falsy blank string if there is no
>       #       profile_list, which makes no options form be presented.
>       #
>       # ref: https://github.com/jupyterhub/kubespawner/blob/37a80abb0a6c826e5c118a068fa1cf2725738038/kubespawner/spawner.py#L1885-L1935
>       #
>       return self._options_form_default()
> 
>       c.KubeSpawner.options_form = dynamic_options_form
178c227
<     type: LoadBalancer
---
>     type: NodePort
182,183c231,232
<       http:
<       https:
---
>       http: 30080
>       https: 30443
221c270,271
<     nodeSelector: {}
---
>     nodeSelector:
>       node-role.kubernetes.io/control-plane: ""
267c317,318
<     nodeSelector: {}
---
>     nodeSelector:
>       node-role.kubernetes.io/control-plane: ""
324c375,376
<   nodeSelector: {}
---
>   nodeSelector:
>     node-role.kubernetes.io/worker: worker
356c408,410
<     allowedIngressPorts: []
---
>     allowedIngressPorts:
>       - 8000
>       - 7680
373,374c427,433
<     extraVolumes: []
<     extraVolumeMounts: []
---
>     extraVolumes:
>       - name: shm-volume
>         emptyDir:
>           medium: Memory
>     extraVolumeMounts:
>       - name: shm-volume
>         mountPath: /dev/shm
378c437
<     capacity: 10Gi
---
>     capacity: 200G
381c440
<       storageClass:
---
>       storageClass: longhorn
386,388c445,447
<     name: jupyterhub/k8s-singleuser-sample
<     tag: "2.0.1-0.dev.git.6026.h0e7347d7"
<     pullPolicy:
---
>     name: registry.kitegg.de/library/kitegg-singleuser
>     tag: cuda-11.8-devel
>     pullPolicy: Always
390c449
<   startTimeout: 300
---
>   startTimeout: 120
392,393c451,452
<     limit:
<     guarantee:
---
>     limit: 32
>     guarantee: 4
395,396c454,455
<     limit:
<     guarantee: 1G
---
>     limit: 256G
>     guarantee: 32G
401,402c460,464
<   defaultUrl:
<   extraPodConfig: {}
---
>   defaultUrl: "/lab"
>   extraPodConfig:
>     securityContext:
>       fsGroup: 100
>       fsGroupChangePolicy: "OnRootMismatch"
647c709
<   enabled: true
---
>   enabled: false
649c711
<   adminUsers: true # --cull-admin-users
---
>   adminUsers: false # --cull-admin-users

consideRatio commented 1 year ago

Can you verify your chart version are using jupyterhub 3.1.1 by visiting https:///hub/api ?

The chart indicates this, but I'm not 100% confident we haven't made a mistake with this and I want to rule out this is related to jupyterhub 4.

dasantonym commented 1 year ago

Sure! Did that and the output was {"version": "3.1.1"}.

manics commented 1 year ago

Do you know which older Z2JH chart version was definitely working for you?

Are your admin_users also members of one of admin_groups? https://github.com/jupyterhub/oauthenticator/blob/15.1.0/oauthenticator/generic.py#L184-L219

consideRatio commented 1 year ago

:+1: for investigating what @manics links to, I also suspect this relates to how the GenericOAuthenticator behaves when configured with allowed_groups and admin_groups. It could be a bug depending on how we believe it should behave etc.

admin_groups is defined here.

dasantonym commented 1 year ago

OMG, thanks so much for pointing this out! Looks like it is working consistently now. It is rather weird that it somehow sometimes worked in the past (I suspect that was chart version 1.2.0, but don't take my word for it).

I did not have the admin_groups value in the config at all. This totally makes sense and I suspected that it had to do with the OAuthenticator, but failed to make that connection.

Not sure if this is a bug... Guess that'll be for you to decide. But maybe it could just be pointed out in the docs?

consideRatio commented 1 year ago

I did not have the admin_groups value in the config at all. This totally makes sense and I suspected that it had to do with the OAuthenticator, but failed to make that connection.

Ah, I'd say its a bug! The bug would be that when GenericOAuthenticator is used with allowed_groups, the configured admin_users won't be respected. I expect that you can be an admin by either being in admin_groups and/or being listed in admin_users. Do you agree that it would be the expected behavior? I figure it should at least provide a warning when starting jupyterhub about this if not.

I suspect what may happen is that any logged in user would be recognized as an admin if the hub has started up and read admin_users, but then if the hub is started and a user logs in - the user would be checked against admin_groups and may be stripped of permissions. If that is the case, then its absolutely a bug because you can get users being admin sometimes and sometimes not depending on temporary state.

Do you wish to contribute further to https://github.com/jupyterhub/oauthenticator by opening an issue about this?

dasantonym commented 1 year ago

True, it is not really a consistent behaviour and explains why users that were already logged in became admin again after restarting the hub and lost their status once they log out and back in (I now remember that this was the case).

Sure, I'll file an issue with the other repo!

jupyterhub / zero-to-jupyterhub-k8s