nebari-dev / nebari

🪴 Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
274 stars 88 forks source link

[BUG] - Timeout waiting for module.kubernetes-keycloak-helm.helm_release.keycloak #1491

Closed abarciauskas-bgse closed 1 year ago

abarciauskas-bgse commented 1 year ago

OS system and architecture in which you are running QHub

macOS Catalina 10.15.7

Expected behavior

Successful deployment of stage 05-kubernetes-keycloak

Actual behavior

Upon repeat deployments, I am getting this error:

[terraform]: │ Error: timed out waiting for the condition
[terraform]: │ 
[terraform]: │   with module.kubernetes-keycloak-helm.helm_release.keycloak,
[terraform]: │   on modules/kubernetes/keycloak-helm/main.tf line 1, in resource "helm_release" "keycloak":
[terraform]: │    1: resource "helm_release" "keycloak" {

from this change:

[terraform]:   # module.kubernetes-keycloak-helm.helm_release.keycloak will be updated in-place
[terraform]:   ~ resource "helm_release" "keycloak" {
[terraform]:         id                         = "keycloak"
[terraform]:         name                       = "keycloak"
[terraform]:       ~ status                     = "failed" -> "deployed"
[terraform]:         # (26 unchanged attributes hidden)
[terraform]: 
[terraform]:         set {
[terraform]:           # At least one attribute in this block is (or was) sensitive,
[terraform]:           # so its contents will not be displayed.
[terraform]:         }
[terraform]:     }
[terraform]: 
[terraform]: Plan: 0 to add, 1 to change, 0 to destroy.

How to Reproduce the problem?

This is my config:

project_name: aimee-qhub
provider: aws
domain: eo-analytics.delta-backend.com
certificate:
  type: lets-encrypt
  acme_email: aimee@developmentseed.org
  acme_server: https://acme-v02.api.letsencrypt.org/directory
security:
  authentication:
    type: GitHub
    config:
      client_id: XXX
      client_secret: XXX
  keycloak:
    initial_root_password: XXX
default_images:
  jupyterhub: quansight/qhub-jupyterhub:v0.4.3
  jupyterlab: quansight/qhub-jupyterlab:v0.4.3
  dask_worker: quansight/qhub-dask-worker:v0.4.3
storage:
  conda_store: 200Gi
  shared_filesystem: 200Gi
theme:
  jupyterhub:
    hub_title: VEDA QHub
    hub_subtitle: NASA VEDA
    welcome: Welcome to the VEDA Analytics QHub.
    logo: https://cdn.cdnlogo.com/logos/n/66/nasa.png
    primary_color: '#5d7fb9'
    secondary_color: '#000000'
    accent_color: '#32C574'
    text_color: '#5d7fb9'
    h1_color: '#5d7fb9'
    h2_color: '#5d7fb9'
    version: v0.4.3
helm_extensions: []
monitoring:
  enabled: true
argo_workflows:
  enabled: true
kbatch:
  enabled: true
cdsdashboards:
  enabled: true
  cds_hide_user_named_servers: true
  cds_hide_user_dashboard_servers: false
ci_cd:
  type: github-actions
  branch: main
  commit_render: true
terraform_state:
  type: remote
namespace: dev
qhub_version: 0.4.3
amazon_web_services:
  region: us-west-2
  kubernetes_version: '1.23'
  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m5.xlarge
      min_nodes: 1
      max_nodes: 5
    worker:
      instance: m5.xlarge
      min_nodes: 1
      max_nodes: 5
jupyterhub:
  overrides:
    singleuser:
      lifecycleHooks:
        postStart:
          exec:
            command:
              [
                "gitpuller",
                "https://github.com/NASA-IMPACT/veda-documentation",
                "master",
                "docs",
             ]
profiles:
  jupyterlab:
  - display_name: Small Instance
    description: Stable environment with 2 cpu / 8 GB ram
    default: true
    kubespawner_override:
      cpu_limit: 2
      cpu_guarantee: 1.5
      mem_limit: 8G
      mem_guarantee: 5G
  - display_name: Medium Instance
    description: Stable environment with 4 cpu / 16 GB ram
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
  dask_worker:
    Small Worker:
      worker_cores_limit: 2
      worker_cores: 1.5
      worker_memory_limit: 8G
      worker_memory: 5G
      worker_threads: 2
    Medium Worker:
      worker_cores_limit: 4
      worker_cores: 3
      worker_memory_limit: 16G
      worker_memory: 10G
      worker_threads: 4
environments:
  environment-dask.yaml:
    name: dask
    channels:
    - conda-forge
    dependencies:
    - nbgitpuller
    - python
    - ipykernel
    - ipywidgets
    - qhub-dask ==0.4.3
    - python-graphviz
    - numpy
    - numba
    - pandas
    - pip:
      - kbatch
  environment-dashboard.yaml:
    name: dashboard
    channels:
    - conda-forge
    dependencies:
    - nbgitpuller
    - python==3.9.7
    - ipykernel==6.4.1
    - ipywidgets==7.6.5
    - qhub-dask==0.4.3
    - param==1.11.1
    - python-graphviz==0.17
    - matplotlib==3.4.3
    - panel==0.12.7
    - voila==0.3.5
    - streamlit==1.0.0
    - dash==2.0.0
    - cdsdashboards-singleuser==0.6.1

so with GITHUBTOKEN, AWS creds set, I run the qhub deploy -c aimee-config.yaml

Command output

No response

Versions and dependencies used.

$ conda --version
conda 4.13.0

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}

$ qhub --version
0.4.3

Others that might be relevant:

Compute environment

AWS

Integrations

Keycloak

Anything else?

No response

viniciusdc commented 1 year ago

@abarciauskas-bgse if you have kubectl installed could you ran the following:

viniciusdc commented 1 year ago

also, have you tried running the deploy command after a few minutes passed since the first timeout error?

iameskild commented 1 year ago

Hi @abarciauskas-bgse, we released v0.4.5 Friday. This latest release should resolve the issues you're experiencing above. Thank you for your interest in QHub and let us if your deployment was successful :)

abarciauskas-bgse commented 1 year ago

Adding @tracetechnical to this thread - he has been troubleshooting our deployment and found and fixed the issue with EBS CSI drivers and the updated kubernetes version, so we do believe the 0.4.5 upgrade should fix this problem. Thanks @iameskild

iameskild commented 1 year ago

@abarciauskas-bgse @tracetechnical wonderful! We also fixed the EBS CSI driver issue in this latest release so it sounds like we're in good shape. Unless there is anything else, at this point, I think we can close this issue.