nebari-dev / nebari

ðŸŠī Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
282 stars 93 forks source link

Unable to Connect to Kubernetes Ingress Host After Local Deployment #2555

Open arjxn-py opened 4 months ago

arjxn-py commented 4 months ago

Context

I am encountering issues connecting to the Kubernetes ingress host (172.18.1.100:80) after successfully deploying Nebari locally using the "kind" Kubernetes setup. This issue occurs despite the Terraform apply completing without errors.

Error Messages

After deployment, I attempted to connect to the ingress host and received the following error messages:

[terraform]: 
[terraform]: Outputs:
[terraform]: 
[terraform]: load_balancer_address = {
[terraform]:   "hostname" = ""
[terraform]:   "ip" = "172.18.1.100"
[terraform]: }
Attempt 1 failed to connect to tcp tcp://172.18.1.100:80
Attempt 2 failed to connect to tcp tcp://172.18.1.100:80
Attempt 3 failed to connect to tcp tcp://172.18.1.100:80
Attempt 4 failed to connect to tcp tcp://172.18.1.100:80
Attempt 5 failed to connect to tcp tcp://172.18.1.100:80
Attempt 6 failed to connect to tcp tcp://172.18.1.100:80
Attempt 7 failed to connect to tcp tcp://172.18.1.100:80
Attempt 8 failed to connect to tcp tcp://172.18.1.100:80
Attempt 9 failed to connect to tcp tcp://172.18.1.100:80
Attempt 10 failed to connect to tcp tcp://172.18.1.100:80
ERROR: After stage=04-kubernetes-ingress unable to connect to ingress host=172.18.1.100 port=80

Steps to Reproduce

Initialize Nebari in local mode with the following command:

nebari init local  --project localnebari  --domain localhost  --auth-provider password  --terraform-state=local

Deploy Nebari using the generated nebari-config.yaml file:

nebari deploy -c nebari-config.yaml --disable-prompt

Value and/or benefit

Streamlined local deployment process.

Anything else?

Environment Details

Nebari Version: built from source Kubernetes Version: v0.18.0 Operating System: Ubuntu 22.04 on WSL2 Docker Engine Version: 26.1.4

arjxn-py commented 4 months ago

After the community meeting on Tuesday, I've been trying to locally deploy nebari while following https://www.nebari.dev/docs/how-tos/nebari-local/ Below is the configuration file i.e. nebari-config.yaml :

provider: local
namespace: dev
nebari_version: 0.1.dev1567+g7396df8.d20240624
project_name: localnebari
domain: localhost
ci_cd:
  type: none
terraform_state:
  type: local
security:
  keycloak:
    initial_root_password: b5n9wa51a7o9hzfkuc4zg81pw6pd28im
  authentication:
    type: password
theme:
  jupyterhub:
    hub_title: Nebari - localnebari
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    hub_subtitle: Your open source data science platform, hosted
local:
  kube_context:
  node_selectors:
    general:
      key: kubernetes.io/os
      value: linux
    user:
      key: kubernetes.io/os
      value: linux
    worker:
      key: kubernetes.io/os
      value: linux

There is a strong possibility that i might have been doing something wrong, hence happy to get your help here. Thanks 💐

marcelovilla commented 4 months ago

Hey @arjxn-py

One reason you might be seeing this is because:

Docker containers cannot be executed natively on macOS and Windows, therefore Docker Desktop runs them in a Linux VM. As a consequence, the container networks are not exposed to the host and you cannot reach the kind nodes via IP.

See: https://kind.sigs.k8s.io/docs/user/known-issues/#docker-desktop-for-macos-and-windows

I'm not sure whether this would happen on WSL2 but it's the same error we were getting before on MacOS because the nodes were not being exposed. For Mac, we solved this using https://github.com/chipmk/docker-mac-net-connect

Are you using Docker or Docker Desktop on WSL2? I haven't see anyone deploy Nebari locally on WSL2 but in theory it should be possible as long as the nodes are being exposed.

viniciusdc commented 4 months ago

also, how was your docker installed, are you using the docker desktop?

arjxn-py commented 4 months ago

Thanks for your response @marcelovilla @viniciusdc & apologies for the delay as I am also looking into the same to fix it locally and then report.

Yes I'm using Docker Desktop with WSL2 on windows.

arjxn-py commented 4 months ago

Hi, I was able to get over the above error by uninstalling docker desktop and switching to docker engine natively installed on WSL. Should we also add a note in the docs about this so that other users or contributors do not run in the same issue?

But deploy is still not being able to complete as it's being stuck & timed-out with these logs:

Please open this dropdown for logs.
```python [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 5m0s elapsed] [terraform]: module.argo-workflows[0].helm_release.argo-workflows: Still modifying... [id=argo-workflows, 5m0s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 5m10s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 5m20s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 5m30s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 5m40s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 5m50s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 6m0s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 6m10s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 6m20s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 6m30s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 6m40s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 6m50s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 7m0s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 7m10s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 7m20s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 7m30s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 7m40s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 7m50s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 8m0s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 8m10s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 8m20s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 8m30s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 8m40s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 8m50s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 9m0s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 9m10s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 9m20s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 9m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 9m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 9m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 10m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 10m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 10m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 10m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 10m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 10m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 11m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 11m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 11m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 11m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 11m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 11m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 12m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 12m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 12m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 12m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 12m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 12m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 13m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 13m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 13m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 13m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 13m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 13m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 14m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 14m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 14m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 14m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 14m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 14m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 15m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 15m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 15m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 15m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 15m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 15m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 16m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 16m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 16m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 16m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 16m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 16m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 17m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 17m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 17m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 17m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 17m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 17m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 18m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 18m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 18m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 18m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 18m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 18m51s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 19m1s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 19m11s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 19m21s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 19m31s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 19m41s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main: Still destroying... [id=dev/nebari-conda-store-storage, 19m51s elapsed] [terraform]: ╷ [terraform]: │ Warning: "default_secret_name" is no longer applicable for Kubernetes v1.24.0 and above [terraform]: │ [terraform]: │ with module.argo-workflows[0].kubernetes_service_account_v1.argo-admin-sa, [terraform]: │ on modules/kubernetes/services/argo-workflows/main.tf line 188, in resource "kubernetes_service_account_v1" "argo-admin-sa": [terraform]: │ 188: resource "kubernetes_service_account_v1" "argo-admin-sa" { [terraform]: │ [terraform]: │ Starting from version 1.24.0 Kubernetes does not automatically generate a [terraform]: │ token for service accounts, in this case, "default_secret_name" will be [terraform]: │ empty [terraform]: │ [terraform]: │ (and 5 more similar warnings elsewhere) [terraform]: â•ĩ [terraform]: ╷ [terraform]: │ Error: Persistent volume claim nebari-conda-store-storage still exists with finalizers: [kubernetes.io/pvc-protection] [terraform]: │ [terraform]: │ [terraform]: â•ĩ [terraform]: ╷ [terraform]: │ Error: timed out waiting for the condition [terraform]: │ [terraform]: │ with module.argo-workflows[0].helm_release.argo-workflows, [terraform]: │ on modules/kubernetes/services/argo-workflows/main.tf line 10, in resource "helm_release" "argo-workflows": [terraform]: │ 10: resource "helm_release" "argo-workflows" { [terraform]: │ [terraform]: â•ĩ ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────â•Ū │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/subcommands/deploy.py:87 in deploy │ │ │ │ 84 │ │ │ │ │ stages.remove(stage) │ │ 85 │ │ │ rich.print("Skipping remote state provision") │ │ 86 │ │ │ │ ❱ 87 │ │ deploy_configuration( │ │ 88 │ │ │ config, │ │ 89 │ │ │ stages, │ │ 90 │ │ │ disable_prompt=disable_prompt, │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/deploy.py:53 in deploy_configuration │ │ │ │ 50 │ │ with contextlib.ExitStack() as stack: │ │ 51 │ │ │ for stage in stages: │ │ 52 │ │ │ │ s = stage(output_directory=pathlib.Path.cwd(), config=config) │ │ ❱ 53 │ │ │ │ stack.enter_context(s.deploy(stage_outputs, disable_prompt)) │ │ 54 │ │ │ │ │ │ 55 │ │ │ │ if not disable_checks: │ │ 56 │ │ │ │ │ s.check(stage_outputs, disable_prompt) │ │ │ │ /home/arjxnpy/miniconda3/envs/nebari/lib/python3.10/contextlib.py:492 in enter_context │ │ │ │ 489 │ │ # statement. │ │ 490 │ │ _cm_type = type(cm) │ │ 491 │ │ _exit = _cm_type.__exit__ │ │ ❱ 492 │ │ result = _cm_type.__enter__(cm) │ │ 493 │ │ self._push_cm_exit(cm, _exit) │ │ 494 │ │ return result │ │ 495 │ │ │ │ /home/arjxnpy/miniconda3/envs/nebari/lib/python3.10/contextlib.py:135 in __enter__ │ │ │ │ 132 │ │ # they are only needed for recreation, which is not possible anymore │ │ 133 │ │ del self.args, self.kwds, self.func │ │ 134 │ │ try: │ │ ❱ 135 │ │ │ return next(self.gen) │ │ 136 │ │ except StopIteration: │ │ 137 │ │ │ raise RuntimeError("generator didn't yield") from None │ │ 138 │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/stages/base.py:72 in deploy │ │ │ │ 69 │ │ │ deploy_config["terraform_import"] = True │ │ 70 │ │ │ deploy_config["state_imports"] = state_imports │ │ 71 │ │ │ │ ❱ 72 │ │ self.set_outputs(stage_outputs, terraform.deploy(**deploy_config)) │ │ 73 │ │ self.post_deploy(stage_outputs, disable_prompt) │ │ 74 │ │ yield │ │ 75 │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/provider/terraform.py:71 in deploy │ │ │ │ 68 │ │ │ │ ) │ │ 69 │ │ │ │ 70 │ │ if terraform_apply: │ │ ❱ 71 │ │ │ apply(directory, var_files=[f.name]) │ │ 72 │ │ │ │ 73 │ │ if terraform_destroy: │ │ 74 │ │ │ destroy(directory, var_files=[f.name]) │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/provider/terraform.py:151 in apply │ │ │ │ 148 │ │ + ["-var-file=" + _ for _ in var_files] │ │ 149 │ ) │ │ 150 │ with timer(logger, "terraform apply"): │ │ ❱ 151 │ │ run_terraform_subprocess(command, cwd=directory, prefix="terraform") │ │ 152 │ │ 153 │ │ 154 def output(directory=None): │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/provider/terraform.py:118 in │ │ run_terraform_subprocess │ │ │ │ 115 │ terraform_path = download_terraform_binary() │ │ 116 │ logger.info(f" terraform at {terraform_path}") │ │ 117 │ if run_subprocess_cmd([terraform_path] + processargs, **kwargs): │ │ ❱ 118 │ │ raise TerraformException("Terraform returned an error") │ │ 119 │ │ 120 │ │ 121 def version(): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────â•Ŋ TerraformException: Terraform returned an error ```

I'd be more than happy to try things you'd suggest further. Thanks 💐

Adam-D-Lewis commented 4 months ago

@arjxn-py Try deleting and recreating the test cluster.

Something like kind cluster delete -n test-cluster, and then kind create cluster -n test-cluster.

arjxn-py commented 4 months ago

Thanks Adam, will try that & get back :)

arjxn-py commented 4 months ago

But deploy is still not being able to complete as it's being stuck & timed-out with these logs:

I confirmed & this was the error because previous resources while trying to deploy were not deleted. nebari destroy does the work, should we also include this as a note in local deployment docs?

However after this error got resolved, I got this another error which is every time coming just after complete creation of nebari-conda-store-server saying :

[terraform]: │ Error: client rate limiter Wait returned an error: context deadline exceeded
[terraform]: │
[terraform]: │   with module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main,
[terraform]: │   on modules/kubernetes/services/conda-store/worker.tf line 30, in resource "kubernetes_persistent_volume_claim" "main":
[terraform]: │   30: resource "kubernetes_persistent_volume_claim" "main" {
Full Logs
```python [terraform]: module.kubernetes-conda-store-server.kubernetes_deployment.server: Still creating... [18m6s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_deployment.server: Still creating... [18m16s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_deployment.server: Still creating... [18m26s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_deployment.server: Still creating... [18m36s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_deployment.server: Still creating... [18m46s elapsed] [terraform]: module.kubernetes-conda-store-server.kubernetes_deployment.server: Creation complete after 18m52s [id=dev/nebari-conda-store-server] [terraform]: ╷ [terraform]: │ Warning: "default_secret_name" is no longer applicable for Kubernetes v1.24.0 and above [terraform]: │ [terraform]: │ with module.argo-workflows[0].kubernetes_service_account_v1.argo-admin-sa, [terraform]: │ on modules/kubernetes/services/argo-workflows/main.tf line 188, in resource "kubernetes_service_account_v1" "argo-admin-sa": [terraform]: │ 188: resource "kubernetes_service_account_v1" "argo-admin-sa" { [terraform]: │ [terraform]: │ Starting from version 1.24.0 Kubernetes does not automatically generate a [terraform]: │ token for service accounts, in this case, "default_secret_name" will be [terraform]: │ empty [terraform]: │ [terraform]: │ (and 5 more similar warnings elsewhere) [terraform]: â•ĩ [terraform]: ╷ [terraform]: │ Error: client rate limiter Wait returned an error: context deadline exceeded [terraform]: │ [terraform]: │ with module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main, [terraform]: │ on modules/kubernetes/services/conda-store/worker.tf line 30, in resource "kubernetes_persistent_volume_claim" "main": [terraform]: │ 30: resource "kubernetes_persistent_volume_claim" "main" { [terraform]: │ [terraform]: â•ĩ ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────â•Ū │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/subcommands/deploy.py:87 in deploy │ │ │ │ 84 │ │ │ │ │ stages.remove(stage) │ │ 85 │ │ │ rich.print("Skipping remote state provision") │ │ 86 │ │ │ │ ❱ 87 │ │ deploy_configuration( │ │ 88 │ │ │ config, │ │ 89 │ │ │ stages, │ │ 90 │ │ │ disable_prompt=disable_prompt, │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/deploy.py:53 in deploy_configuration │ │ │ │ 50 │ │ with contextlib.ExitStack() as stack: │ │ 51 │ │ │ for stage in stages: │ │ 52 │ │ │ │ s = stage(output_directory=pathlib.Path.cwd(), config=config) │ │ ❱ 53 │ │ │ │ stack.enter_context(s.deploy(stage_outputs, disable_prompt)) │ │ 54 │ │ │ │ │ │ 55 │ │ │ │ if not disable_checks: │ │ 56 │ │ │ │ │ s.check(stage_outputs, disable_prompt) │ │ │ │ /home/arjxnpy/miniconda3/envs/nebari/lib/python3.10/contextlib.py:492 in enter_context │ │ │ │ 489 │ │ # statement. │ │ 490 │ │ _cm_type = type(cm) │ │ 491 │ │ _exit = _cm_type.__exit__ │ │ ❱ 492 │ │ result = _cm_type.__enter__(cm) │ │ 493 │ │ self._push_cm_exit(cm, _exit) │ │ 494 │ │ return result │ │ 495 │ │ │ │ /home/arjxnpy/miniconda3/envs/nebari/lib/python3.10/contextlib.py:135 in __enter__ │ │ │ │ 132 │ │ # they are only needed for recreation, which is not possible anymore │ │ 133 │ │ del self.args, self.kwds, self.func │ │ 134 │ │ try: │ │ ❱ 135 │ │ │ return next(self.gen) │ │ 136 │ │ except StopIteration: │ │ 137 │ │ │ raise RuntimeError("generator didn't yield") from None │ │ 138 │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/stages/base.py:72 in deploy │ │ │ │ 69 │ │ │ deploy_config["terraform_import"] = True │ │ 70 │ │ │ deploy_config["state_imports"] = state_imports │ │ 71 │ │ │ │ ❱ 72 │ │ self.set_outputs(stage_outputs, terraform.deploy(**deploy_config)) │ │ 73 │ │ self.post_deploy(stage_outputs, disable_prompt) │ │ 74 │ │ yield │ │ 75 │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/provider/terraform.py:71 in deploy │ │ │ │ 68 │ │ │ │ ) │ │ 69 │ │ │ │ 70 │ │ if terraform_apply: │ │ ❱ 71 │ │ │ apply(directory, var_files=[f.name]) │ │ 72 │ │ │ │ 73 │ │ if terraform_destroy: │ │ 74 │ │ │ destroy(directory, var_files=[f.name]) │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/provider/terraform.py:151 in apply │ │ │ │ 148 │ │ + ["-var-file=" + _ for _ in var_files] │ │ 149 │ ) │ │ 150 │ with timer(logger, "terraform apply"): │ │ ❱ 151 │ │ run_terraform_subprocess(command, cwd=directory, prefix="terraform") │ │ 152 │ │ 153 │ │ 154 def output(directory=None): │ │ │ │ /mnt/c/Users/Arjun/Desktop/Arjun/nebari/src/_nebari/provider/terraform.py:118 in │ │ run_terraform_subprocess │ │ │ │ 115 │ terraform_path = download_terraform_binary() │ │ 116 │ logger.info(f" terraform at {terraform_path}") │ │ 117 │ if run_subprocess_cmd([terraform_path] + processargs, **kwargs): │ │ ❱ 118 │ │ raise TerraformException("Terraform returned an error") │ │ 119 │ │ 120 │ │ 121 def version(): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────â•Ŋ TerraformException: Terraform returned an error ```

I'm really sorry to be bothering you again as local deployment is something that not many people from the community have been doing. Thanks a lot for your help 💐

asmacdo commented 3 months ago

Similar issue on linux (i used the same commands above for easy comparison)

nebari init local  --project localnebari  --domain localhost  --auth-provider password  --terraform-state=local
nebari deploy -c nebari-config.yaml --disable-prompt
uname -a
Linux fancy 6.8.9-100.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May  2 18:50:49 UTC 2024 x86_64 GNU/Linux
docker --version 
Docker version 26.1.4, build 5650f9b
kind --version
kind version 0.14.0
nebari --version
2024.6.1

tf error

[terraform]: │ Warning: "default_secret_name" is no longer applicable for Kubernetes v1.24.0 and above
[terraform]: │ 
[terraform]: │   with module.argo-workflows[0].kubernetes_service_account_v1.argo-admin-sa,
[terraform]: │   on modules/kubernetes/services/argo-workflows/main.tf line 188, in resource "kubernetes_service_account_v1" "argo-admin-sa":
[terraform]: │  188: resource "kubernetes_service_account_v1" "argo-admin-sa" {
[terraform]: │ 
[terraform]: │ Starting from version 1.24.0 Kubernetes does not automatically generate a
[terraform]: │ token for service accounts, in this case, "default_secret_name" will be
[terraform]: │ empty
[terraform]: │ 
[terraform]: │ (and 5 more similar warnings elsewhere)
[terraform]: â•ĩ
[terraform]: ╷
[terraform]: │ Error: timed out waiting for the condition
[terraform]: │ 
[terraform]: │   with module.argo-workflows[0].helm_release.argo-workflows,
[terraform]: │   on modules/kubernetes/services/argo-workflows/main.tf line 10, in resource "helm_release" "argo-workflows":
[terraform]: │   10: resource "helm_release" "argo-workflows" {
[terraform]: │ 
[terraform]: â•ĩ
[terraform]: ╷
[terraform]: │ Error: timed out waiting for the condition
[terraform]: │ 
[terraform]: │   with module.jupyterhub.helm_release.jupyterhub,
[terraform]: │   on modules/kubernetes/services/jupyterhub/main.tf line 54, in resource "helm_release" "jupyterhub":
[terraform]: │   54: resource "helm_release" "jupyterhub" {
[terraform]: │ 

nebari-config.yaml

``` provider: local namespace: dev nebari_version: 2024.6.1 project_name: localnebari domain: localhost ci_cd: type: none terraform_state: type: local security: keycloak: initial_root_password: authentication: type: password theme: jupyterhub: hub_title: Nebari - localnebari welcome: Welcome! Learn about Nebari's features and configurations in the documentation. If you have any questions or feedback, reach the team on Nebari's support forums. hub_subtitle: Your open source data science platform, hosted local: kube_context: node_selectors: general: key: kubernetes.io/os value: linux user: key: kubernetes.io/os value: linux worker: key: kubernetes.io/os value: linux

I have 2 pods failing

NAME                                                     READY   STATUS             RESTARTS      AGE
alertmanager-nebari-kube-prometheus-sta-alertmanager-0   2/2     Running            0             9m12s
argo-workflows-server-585dd7f586-gcrxm                   0/1     Error              3 (30s ago)   49s
argo-workflows-workflow-controller-586dcfd8f7-8rggr      1/1     Running            0             10m
continuous-image-puller-bd4bc                            1/1     Running            0             7m53s
forwardauth-deployment-7c454d6758-zdmxm                  1/1     Running            0             10m
hub-6c4f884494-lvq59                                     0/1     CrashLoopBackOff   6 (81s ago)   7m53s
keycloak-0                                               1/1     Running            0             11m
keycloak-postgresql-0                                    1/1     Running            0             11m
loki-backend-0                                           2/2     Running            0             9m15s
loki-canary-gs754                                        1/1     Running            0             9m15s
loki-gateway-bf4d7b485-nbqgn                             1/1     Running            0             9m15s
loki-read-f4c8cc665-cdmfw                                1/1     Running            0             9m15s
loki-write-0                                             1/1     Running            0             9m15s
nebari-conda-store-minio-66688b88d8-pvdlz                1/1     Running            0             9m38s
nebari-conda-store-postgresql-postgresql-0               1/1     Running            0             9m46s
nebari-conda-store-redis-master-0                        1/1     Running            0             9m33s
nebari-conda-store-server-78db858d79-lc9wm               1/1     Running            0             8m57s
nebari-conda-store-worker-d6bc68778-mqqk4                2/2     Running            0             8m57s
nebari-daskgateway-controller-74db8c5df-6zfcn            1/1     Running            0             8m15s
nebari-daskgateway-gateway-57bb7cd7d4-rbl87              1/1     Running            0             8m15s
nebari-grafana-7759cd44bc-tvnvv                          3/3     Running            0             9m15s
nebari-jupyterhub-sftp-68d8999fd7-xcq28                  1/1     Running            0             9m51s
nebari-jupyterhub-ssh-f47499886-spsfx                    1/1     Running            0             10m
nebari-kube-prometheus-sta-operator-77cbbffb7d-7m7ng     1/1     Running            0             9m15s
nebari-kube-state-metrics-65b8c8fd48-krh4c               1/1     Running            0             9m15s
nebari-loki-minio-647979b7c5-8j4hb                       1/1     Running            0             9m50s
nebari-prometheus-node-exporter-b9846                    1/1     Running            0             9m15s
nebari-promtail-c6zbv                                    1/1     Running            0             7m55s
nebari-traefik-ingress-75f6d994dd-b6dkx                  1/1     Running            0             18m
nebari-workflow-controller-5d98dbf8bd-8jhxv              1/1     Running            0             10m
nfs-server-nfs-6b8c9cd476-nk8v7                          1/1     Running            0             10m
prometheus-nebari-kube-prometheus-sta-prometheus-0       2/2     Running            0             9m12s
proxy-f9ffcfd97-zpglq                                    1/1     Running            0             7m53s
user-scheduler-849d4f8d95-59slv                          1/1     Running            0             7m53s
user-scheduler-849d4f8d95-pkn6w                          1/1     Running            0             7m53s

Argo worfklows server is getting connection refuested

k logs argo-workflows-server-585dd7f586-gcrxm -n dev
time="2024-08-06T18:23:12.618Z" level=info msg="not enabling pprof debug endpoints"
time="2024-08-06T18:23:12.618Z" level=info authModes="[sso client]" baseHRef=/argo/ managedNamespace=dev namespace=dev secure=false ssoNamespace=dev
time="2024-08-06T18:23:12.618Z" level=warning msg="You are running in insecure mode. Learn how to enable transport layer security: https://argoproj.github.io/argo-workflows/tls/"
Error: Get "https://localhost/auth/realms/nebari/.well-known/openid-configuration": dial tcp [::1]:443: connect: connection refused
Usage:
  argo server [flags]

Examples:
<helptext snip>

Jupyterhub pod similarly failing

k logs hub-6c4f884494-lvq59 -n dev
Loading /usr/local/etc/jupyterhub/secret/values.yaml
No config at /usr/local/etc/jupyterhub/existing-secret/values.yaml
Loading extra config: 01-theme.py
Loading extra config: 02-spawner.py
Loading extra config: 03-profiles.py
Loading extra config: 04-auth.py
[W 2024-08-06 18:21:11.668 JupyterHub app:1697] JupyterHub.extra_handlers is deprecated in JupyterHub 3.1. Please use JupyterHub services to register additional HTTP endpoints.
[I 2024-08-06 18:21:11.668 JupyterHub app:3286] Running JupyterHub version 5.0.0
[I 2024-08-06 18:21:11.668 JupyterHub app:3316] Using Authenticator: builtins.KeyCloakOAuthenticator
[I 2024-08-06 18:21:11.668 JupyterHub app:3316] Using Spawner: kubespawner.spawner.KubeSpawner-4.2.0
[I 2024-08-06 18:21:11.668 JupyterHub app:3316] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-5.0.0
[I 2024-08-06 18:21:11.741 JupyterHub <string>:109] Loading managed roles
[E 2024-08-06 18:21:11.742 JupyterHub app:3852]
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/jupyterhub/app.py", line 3849, in launch_instance_async
        await self.initialize(argv)
      File "/opt/conda/lib/python3.9/site-packages/jupyterhub/app.py", line 3332, in initialize
        await self.init_role_creation()
      File "/opt/conda/lib/python3.9/site-packages/jupyterhub/app.py", line 2286, in init_role_creation
        managed_roles = await self.authenticator.load_managed_roles()
      File "<string>", line 114, in load_managed_roles
      File "<string>", line 221, in _get_token
    tornado.curl_httpclient.CurlError: HTTP 599: Failed to connect to localhost port 443 after 0 ms: Connection refused

/etc/hosts/ 172.18.1.100 localhost

FWIW, the first attempt I used demo.com rather than localhost, and encountered similar issues. I used localhost to match original bug report. (I did make sure that the kind cluster was deleted before trying again)

I even tried messing with coredns

 kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes localhost in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2024-08-06T18:02:48Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "5010"
  uid: c45ab300-4877-468d-8926-989a5fcf953e

If anyone can help me get this set up, I'm happy to update the docs!

viniciusdc commented 3 months ago

Hi @asmacdo @arjxn-py sorry for the delay on following back, let's go in by parts:

This does not directly interfere in your problem but helps as a general service dependency context, if any of these resource get in a crashloop, the usual order of inner dependency between the resources will be:

graph TB
   A[keycloak-postgresql]
   B[keycloak]
   C[jupyterhub]
   D[argo]
   E[conda-store]
   F[conda-store-worker]
   G[jupyterlab-user-pod]
   H(shared-efs-pvc)
   I(conda-store-pvc)
   A --> B
   B --> E
   E --> C
   E --> D
   E --> F
   F -->|volumeMount|I
   C --> G
   G -->|volumeMount|H
   G -->|volumeMount| I

But deploy is still not being able to complete as it's being stuck & timed-out with these logs:

I confirmed & this was the error because previous resources while trying to deploy were not deleted. nebari destroy does the work, should we also include this as a note in local deployment docs?

However after this error got resolved, I got this another error which is every time coming just after complete creation of nebari-conda-store-server saying :

[terraform]: │ Error: client rate limiter Wait returned an error: context deadline exceeded
[terraform]: │
[terraform]: │   with module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main,
[terraform]: │   on modules/kubernetes/services/conda-store/worker.tf line 30, in resource "kubernetes_persistent_volume_claim" "main":
[terraform]: │   30: resource "kubernetes_persistent_volume_claim" "main" {

Full Logs I'm really sorry to be bothering you again as local deployment is something that not many people from the community have been doing. Thanks a lot for your help 💐

There are two things here that could lead to this. Next time this happens, I ask you if you could check for:

viniciusdc commented 3 months ago

@asmacdo raggarding this:

terraform: │ Error: timed out waiting for the condition

terraform: │ with module.argo-workflows[0].helm_release.argo-workflows, terraform: │ on modules/kubernetes/services/argo-workflows/main.tf line 10, in resource "helm_release" "argo-workflows": terraform: │ 10: resource "helm_release" "argo-workflows" {

terraform: │ Error: timed out waiting for the condition

terraform: │ with module.jupyterhub.helm_release.jupyterhub, terraform: │ on modules/kubernetes/services/jupyterhub/main.tf line 54, in resource "helm_release" "jupyterhub": terraform: │ 54: resource "helm_release" "jupyterhub" {

All the problem comes from this:

Is localhost and acceptable domain?

It is unfortunately not accepted, and the reason is while you can expose traefik to this, the other pods, when receiving the domain name (because each process/service runs on a Linux container) will try to reach their internal localhost endpoints, which will cause confusion, as you can see from argo logs for example:

tornado.curl_httpclient.CurlError: HTTP 599: Failed to connect to localhost port 443 after 0 ms: Connection refused

It's trying to request locahost:443 internally within the pod network.

The domain (which needs to be unique*) should act as a mask for an external IP and it's the job of Traefik to forward/proxy that. That said, for a local deployment though, if the purpose is only testing out things, I would recommend just leaving the domain empty, i.e remove it from your nebari-config.yaml file, and you will end up with an address as such:

https://172.18.1.100/hub

which works just fine (except for dask dashboard and the notebook scheduler extension), if you really need a working DNS you can check https://www.nebari.dev/docs/how-tos/domain-registry#what-is-a-dns for some examples on how to set that with CloudFlare. It might also work with you add a subdomain to localhost e.g nebari.localhost but that can also end up being caught by the linux internal hosts for each pod, but its worth trying out.

I would love it if this could be added to our docs, https://github.com/nebari-dev/nebari-docs

asmacdo commented 3 months ago

@viniciusdc thank you, dropping the domain worked, I've updated the docs from your comments :)

@arjxn-py I suspect using localhost may have contributed to your problem. I marked this issue as related to this issue (not fixes) since I'm not sure if it will fix things for you.