Open arjxn-py opened 4 months ago
After the community meeting on Tuesday, I've been trying to locally deploy nebari while following https://www.nebari.dev/docs/how-tos/nebari-local/
Below is the configuration file i.e. nebari-config.yaml
:
provider: local
namespace: dev
nebari_version: 0.1.dev1567+g7396df8.d20240624
project_name: localnebari
domain: localhost
ci_cd:
type: none
terraform_state:
type: local
security:
keycloak:
initial_root_password: b5n9wa51a7o9hzfkuc4zg81pw6pd28im
authentication:
type: password
theme:
jupyterhub:
hub_title: Nebari - localnebari
welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
documentation</a>. If you have any questions or feedback, reach the team on
<a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
forums</a>.
hub_subtitle: Your open source data science platform, hosted
local:
kube_context:
node_selectors:
general:
key: kubernetes.io/os
value: linux
user:
key: kubernetes.io/os
value: linux
worker:
key: kubernetes.io/os
value: linux
There is a strong possibility that i might have been doing something wrong, hence happy to get your help here. Thanks ð
Hey @arjxn-py
One reason you might be seeing this is because:
Docker containers cannot be executed natively on macOS and Windows, therefore Docker Desktop runs them in a Linux VM. As a consequence, the container networks are not exposed to the host and you cannot reach the kind nodes via IP.
See: https://kind.sigs.k8s.io/docs/user/known-issues/#docker-desktop-for-macos-and-windows
I'm not sure whether this would happen on WSL2 but it's the same error we were getting before on MacOS because the nodes were not being exposed. For Mac, we solved this using https://github.com/chipmk/docker-mac-net-connect
Are you using Docker or Docker Desktop on WSL2? I haven't see anyone deploy Nebari locally on WSL2 but in theory it should be possible as long as the nodes are being exposed.
also, how was your docker installed, are you using the docker desktop?
Thanks for your response @marcelovilla @viniciusdc & apologies for the delay as I am also looking into the same to fix it locally and then report.
Yes I'm using Docker Desktop with WSL2 on windows.
Hi, I was able to get over the above error by uninstalling docker desktop and switching to docker engine natively installed on WSL. Should we also add a note in the docs about this so that other users or contributors do not run in the same issue?
But deploy is still not being able to complete as it's being stuck & timed-out with these logs:
I'd be more than happy to try things you'd suggest further. Thanks ð
@arjxn-py Try deleting and recreating the test cluster.
Something like kind cluster delete -n test-cluster
, and then kind create cluster -n test-cluster
.
Thanks Adam, will try that & get back :)
But deploy is still not being able to complete as it's being stuck & timed-out with these logs:
I confirmed & this was the error because previous resources while trying to deploy were not deleted.
nebari destroy
does the work, should we also include this as a note in local deployment docs?
However after this error got resolved, I got this another error which is every time coming just after complete creation of nebari-conda-store-server
saying :
[terraform]: â Error: client rate limiter Wait returned an error: context deadline exceeded
[terraform]: â
[terraform]: â with module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main,
[terraform]: â on modules/kubernetes/services/conda-store/worker.tf line 30, in resource "kubernetes_persistent_volume_claim" "main":
[terraform]: â 30: resource "kubernetes_persistent_volume_claim" "main" {
I'm really sorry to be bothering you again as local deployment is something that not many people from the community have been doing. Thanks a lot for your help ð
Similar issue on linux (i used the same commands above for easy comparison)
nebari init local --project localnebari --domain localhost --auth-provider password --terraform-state=local
nebari deploy -c nebari-config.yaml --disable-prompt
uname -a
Linux fancy 6.8.9-100.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May 2 18:50:49 UTC 2024 x86_64 GNU/Linux
docker --version
Docker version 26.1.4, build 5650f9b
kind --version
kind version 0.14.0
nebari --version
2024.6.1
tf error
[terraform]: â Warning: "default_secret_name" is no longer applicable for Kubernetes v1.24.0 and above
[terraform]: â
[terraform]: â with module.argo-workflows[0].kubernetes_service_account_v1.argo-admin-sa,
[terraform]: â on modules/kubernetes/services/argo-workflows/main.tf line 188, in resource "kubernetes_service_account_v1" "argo-admin-sa":
[terraform]: â 188: resource "kubernetes_service_account_v1" "argo-admin-sa" {
[terraform]: â
[terraform]: â Starting from version 1.24.0 Kubernetes does not automatically generate a
[terraform]: â token for service accounts, in this case, "default_secret_name" will be
[terraform]: â empty
[terraform]: â
[terraform]: â (and 5 more similar warnings elsewhere)
[terraform]: âĩ
[terraform]: â·
[terraform]: â Error: timed out waiting for the condition
[terraform]: â
[terraform]: â with module.argo-workflows[0].helm_release.argo-workflows,
[terraform]: â on modules/kubernetes/services/argo-workflows/main.tf line 10, in resource "helm_release" "argo-workflows":
[terraform]: â 10: resource "helm_release" "argo-workflows" {
[terraform]: â
[terraform]: âĩ
[terraform]: â·
[terraform]: â Error: timed out waiting for the condition
[terraform]: â
[terraform]: â with module.jupyterhub.helm_release.jupyterhub,
[terraform]: â on modules/kubernetes/services/jupyterhub/main.tf line 54, in resource "helm_release" "jupyterhub":
[terraform]: â 54: resource "helm_release" "jupyterhub" {
[terraform]: â
nebari-config.yaml
I have 2 pods failing
NAME READY STATUS RESTARTS AGE
alertmanager-nebari-kube-prometheus-sta-alertmanager-0 2/2 Running 0 9m12s
argo-workflows-server-585dd7f586-gcrxm 0/1 Error 3 (30s ago) 49s
argo-workflows-workflow-controller-586dcfd8f7-8rggr 1/1 Running 0 10m
continuous-image-puller-bd4bc 1/1 Running 0 7m53s
forwardauth-deployment-7c454d6758-zdmxm 1/1 Running 0 10m
hub-6c4f884494-lvq59 0/1 CrashLoopBackOff 6 (81s ago) 7m53s
keycloak-0 1/1 Running 0 11m
keycloak-postgresql-0 1/1 Running 0 11m
loki-backend-0 2/2 Running 0 9m15s
loki-canary-gs754 1/1 Running 0 9m15s
loki-gateway-bf4d7b485-nbqgn 1/1 Running 0 9m15s
loki-read-f4c8cc665-cdmfw 1/1 Running 0 9m15s
loki-write-0 1/1 Running 0 9m15s
nebari-conda-store-minio-66688b88d8-pvdlz 1/1 Running 0 9m38s
nebari-conda-store-postgresql-postgresql-0 1/1 Running 0 9m46s
nebari-conda-store-redis-master-0 1/1 Running 0 9m33s
nebari-conda-store-server-78db858d79-lc9wm 1/1 Running 0 8m57s
nebari-conda-store-worker-d6bc68778-mqqk4 2/2 Running 0 8m57s
nebari-daskgateway-controller-74db8c5df-6zfcn 1/1 Running 0 8m15s
nebari-daskgateway-gateway-57bb7cd7d4-rbl87 1/1 Running 0 8m15s
nebari-grafana-7759cd44bc-tvnvv 3/3 Running 0 9m15s
nebari-jupyterhub-sftp-68d8999fd7-xcq28 1/1 Running 0 9m51s
nebari-jupyterhub-ssh-f47499886-spsfx 1/1 Running 0 10m
nebari-kube-prometheus-sta-operator-77cbbffb7d-7m7ng 1/1 Running 0 9m15s
nebari-kube-state-metrics-65b8c8fd48-krh4c 1/1 Running 0 9m15s
nebari-loki-minio-647979b7c5-8j4hb 1/1 Running 0 9m50s
nebari-prometheus-node-exporter-b9846 1/1 Running 0 9m15s
nebari-promtail-c6zbv 1/1 Running 0 7m55s
nebari-traefik-ingress-75f6d994dd-b6dkx 1/1 Running 0 18m
nebari-workflow-controller-5d98dbf8bd-8jhxv 1/1 Running 0 10m
nfs-server-nfs-6b8c9cd476-nk8v7 1/1 Running 0 10m
prometheus-nebari-kube-prometheus-sta-prometheus-0 2/2 Running 0 9m12s
proxy-f9ffcfd97-zpglq 1/1 Running 0 7m53s
user-scheduler-849d4f8d95-59slv 1/1 Running 0 7m53s
user-scheduler-849d4f8d95-pkn6w 1/1 Running 0 7m53s
Argo worfklows server is getting connection refuested
k logs argo-workflows-server-585dd7f586-gcrxm -n dev
time="2024-08-06T18:23:12.618Z" level=info msg="not enabling pprof debug endpoints"
time="2024-08-06T18:23:12.618Z" level=info authModes="[sso client]" baseHRef=/argo/ managedNamespace=dev namespace=dev secure=false ssoNamespace=dev
time="2024-08-06T18:23:12.618Z" level=warning msg="You are running in insecure mode. Learn how to enable transport layer security: https://argoproj.github.io/argo-workflows/tls/"
Error: Get "https://localhost/auth/realms/nebari/.well-known/openid-configuration": dial tcp [::1]:443: connect: connection refused
Usage:
argo server [flags]
Examples:
<helptext snip>
Jupyterhub pod similarly failing
k logs hub-6c4f884494-lvq59 -n dev
Loading /usr/local/etc/jupyterhub/secret/values.yaml
No config at /usr/local/etc/jupyterhub/existing-secret/values.yaml
Loading extra config: 01-theme.py
Loading extra config: 02-spawner.py
Loading extra config: 03-profiles.py
Loading extra config: 04-auth.py
[W 2024-08-06 18:21:11.668 JupyterHub app:1697] JupyterHub.extra_handlers is deprecated in JupyterHub 3.1. Please use JupyterHub services to register additional HTTP endpoints.
[I 2024-08-06 18:21:11.668 JupyterHub app:3286] Running JupyterHub version 5.0.0
[I 2024-08-06 18:21:11.668 JupyterHub app:3316] Using Authenticator: builtins.KeyCloakOAuthenticator
[I 2024-08-06 18:21:11.668 JupyterHub app:3316] Using Spawner: kubespawner.spawner.KubeSpawner-4.2.0
[I 2024-08-06 18:21:11.668 JupyterHub app:3316] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-5.0.0
[I 2024-08-06 18:21:11.741 JupyterHub <string>:109] Loading managed roles
[E 2024-08-06 18:21:11.742 JupyterHub app:3852]
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/jupyterhub/app.py", line 3849, in launch_instance_async
await self.initialize(argv)
File "/opt/conda/lib/python3.9/site-packages/jupyterhub/app.py", line 3332, in initialize
await self.init_role_creation()
File "/opt/conda/lib/python3.9/site-packages/jupyterhub/app.py", line 2286, in init_role_creation
managed_roles = await self.authenticator.load_managed_roles()
File "<string>", line 114, in load_managed_roles
File "<string>", line 221, in _get_token
tornado.curl_httpclient.CurlError: HTTP 599: Failed to connect to localhost port 443 after 0 ms: Connection refused
/etc/hosts/
172.18.1.100 localhost
FWIW, the first attempt I used demo.com
rather than localhost, and encountered similar issues. I used localhost to match original bug report. (I did make sure that the kind cluster was deleted before trying again)
I even tried messing with coredns
kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes localhost in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: "2024-08-06T18:02:48Z"
name: coredns
namespace: kube-system
resourceVersion: "5010"
uid: c45ab300-4877-468d-8926-989a5fcf953e
localhost
and acceptable domain?If anyone can help me get this set up, I'm happy to update the docs!
Hi @asmacdo @arjxn-py sorry for the delay on following back, let's go in by parts:
This does not directly interfere in your problem but helps as a general service dependency context, if any of these resource get in a crashloop, the usual order of inner dependency between the resources will be:
graph TB
A[keycloak-postgresql]
B[keycloak]
C[jupyterhub]
D[argo]
E[conda-store]
F[conda-store-worker]
G[jupyterlab-user-pod]
H(shared-efs-pvc)
I(conda-store-pvc)
A --> B
B --> E
E --> C
E --> D
E --> F
F -->|volumeMount|I
C --> G
G -->|volumeMount|H
G -->|volumeMount| I
But deploy is still not being able to complete as it's being stuck & timed-out with these logs:
I confirmed & this was the error because previous resources while trying to deploy were not deleted.
nebari destroy
does the work, should we also include this as a note in local deployment docs?However after this error got resolved, I got this another error which is every time coming just after complete creation of
nebari-conda-store-server
saying :[terraform]: â Error: client rate limiter Wait returned an error: context deadline exceeded [terraform]: â [terraform]: â with module.kubernetes-conda-store-server.kubernetes_persistent_volume_claim.main, [terraform]: â on modules/kubernetes/services/conda-store/worker.tf line 30, in resource "kubernetes_persistent_volume_claim" "main": [terraform]: â 30: resource "kubernetes_persistent_volume_claim" "main" {
Full Logs I'm really sorry to be bothering you again as local deployment is something that not many people from the community have been doing. Thanks a lot for your help ð
There are two things here that could lead to this. Next time this happens, I ask you if you could check for:
@asmacdo raggarding this:
terraform: â Error: timed out waiting for the condition
terraform: â with module.argo-workflows[0].helm_release.argo-workflows, terraform: â on modules/kubernetes/services/argo-workflows/main.tf line 10, in resource "helm_release" "argo-workflows": terraform: â 10: resource "helm_release" "argo-workflows" {
terraform: â Error: timed out waiting for the condition
terraform: â with module.jupyterhub.helm_release.jupyterhub, terraform: â on modules/kubernetes/services/jupyterhub/main.tf line 54, in resource "helm_release" "jupyterhub": terraform: â 54: resource "helm_release" "jupyterhub" {
All the problem comes from this:
Is localhost and acceptable domain?
It is unfortunately not accepted, and the reason is while you can expose traefik to this, the other pods, when receiving the domain name (because each process/service runs on a Linux container) will try to reach their internal localhost endpoints, which will cause confusion, as you can see from argo logs for example:
tornado.curl_httpclient.CurlError: HTTP 599: Failed to connect to localhost port 443 after 0 ms: Connection refused
It's trying to request locahost:443
internally within the pod network.
The domain (which needs to be unique*) should act as a mask for an external IP and it's the job of Traefik to forward/proxy that. That said, for a local deployment though, if the purpose is only testing out things, I would recommend just leaving the domain empty, i.e remove it from your nebari-config.yaml
file, and you will end up with an address as such:
https://172.18.1.100/hub
which works just fine (except for dask dashboard and the notebook scheduler extension), if you really need a working DNS you can check https://www.nebari.dev/docs/how-tos/domain-registry#what-is-a-dns for some examples on how to set that with CloudFlare. It might also work with you add a subdomain to localhost
e.g nebari.localhost
but that can also end up being caught by the linux internal hosts for each pod, but its worth trying out.
I would love it if this could be added to our docs, https://github.com/nebari-dev/nebari-docs
@viniciusdc thank you, dropping the domain worked, I've updated the docs from your comments :)
@arjxn-py I suspect using localhost
may have contributed to your problem. I marked this issue as related to this issue (not fixes) since I'm not sure if it will fix things for you.
Context
I am encountering issues connecting to the Kubernetes ingress host (172.18.1.100:80) after successfully deploying Nebari locally using the "kind" Kubernetes setup. This issue occurs despite the Terraform apply completing without errors.
Error Messages
After deployment, I attempted to connect to the ingress host and received the following error messages:
Steps to Reproduce
Initialize Nebari in local mode with the following command:
Deploy Nebari using the generated nebari-config.yaml file:
Value and/or benefit
Streamlined local deployment process.
Anything else?
Environment Details
Nebari Version: built from source Kubernetes Version: v0.18.0 Operating System: Ubuntu 22.04 on WSL2 Docker Engine Version: 26.1.4