Closed asg0451 closed 5 months ago
Yay for getting past the previous error!
@asg0451 , is the shutdown happening before you even connect, or only when you try pinging the server? If you're able to ssh into the host machine, are you able to ping it over localhost?
Do you mind also sharing your machine specs?
I haven't seen this error before, let me search around a bit.
Can you also ensure that the machine has access to the internet? By the error, it seems to be a DNS error, but I'm not sure what exactly it could be.
Also try using the env variable KHOJ_DEBUG=True
to possibly get more verbose errors.
Thanks @sabaimran
The shutdown happens as soon as i connect to the machine, either via port forwarding to localhost and connecting as localhost:... or by connecting to its public-facing site (specified in khoj_domain env var above)
This is all on k8s so the pod definitely has access to dns, the internet, and it all should work lol. Other pods work just fine.
K8s on x86_64 (linux)
With KHOJ_DEBUG=True, the output is basically the same:
[15:58:18.693772] INFO 🚒 Initializing Khoj v1.9.0 main.py:108
[15:58:18.696562] INFO 📦 Initializing DB: main.py:109
Operations to perform:
Apply all migrations: admin, auth,
contenttypes, database, sessions
Running migrations:
No migrations to apply.
[15:58:18.697729] DEBUG 🌍 Initializing Web Client: main.py:110
180 static files copied to
'/app/src/khoj/static'.
[15:58:18.701942] INFO 🌘 Starting Khoj main.py:122
[15:58:18.914279] INFO 🚨 Khoj is not configured. configure.py:197
Initializing it with a default
config.
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This sho
return self.fget.__get__(instance, owner)()
[15:58:34.780223] INFO No default conversation config found, __init__.py:447
skipping default agent creation
[15:58:34.788484] INFO 📬 No-op... configure.py:248
[15:58:34.789352] INFO 🌖 Khoj is ready to use main.py:159
[15:58:34.796329] INFO Started server process [1] server.py:75
[15:58:34.797219] INFO Waiting for application startup. on.py:45
[15:58:34.798039] INFO Application startup complete. on.py:59
[15:58:34.799050] ERROR [Errno -2] Name or service not known server.py:156
[15:58:34.799859] INFO Waiting for application shutdown. on.py:64
[15:58:34.800547] INFO Application shutdown complete. on.py:75
(you can see it start up at :18, then i opened it via localhost in my browser at :34 and it dies)
one other thing is that while it initially doesnt crash until i connect to it, afterwards it does seem to crashloop on its own. possibly due to other traffic to it from elsewhere, but i'm not sure. logs are the same
actually, it does seem like maybe the connecting thing was a coincidence. it's definitely just crashlooping on its own :/
One additional clue is that for the first crash only i saw in the logs: Use offline chat model? (y/n):
and some other stuff. so maybe the fact that it was started in a non-interactive way (no stdin) created an invalid config for all future starts?
Thanks for the debugging info!
For reference, this is what it looks like when it successfully starts up. Just did this on a clean install:
khoj-test-server-1 | [05:37:53.191273] INFO 🌖 Khoj is ready to use main.py:159
khoj-test-server-1 | [05:37:53.197530] INFO Started server process [1] server.py:75
khoj-test-server-1 | [05:37:53.198166] INFO Waiting for application startup. on.py:45
khoj-test-server-1 | [05:37:53.198730] INFO Application startup complete. on.py:59
khoj-test-server-1 | [05:37:53.199286] INFO Uvicorn running on http://0.0.0.0:42110 server.py:206
Interestingly, it's never getting to a state where the uvicorn server even starts running on your machine. Given the logs, it's getting right up to that last step before it dies.
Non interactive shouldn't be an issue. As per the instructions, you should have setup KHOJ_ADMIN_EMAIL
and KHOJ_ADMIN_PASSWORD
in your environment for admin credentials to work.
Would you mind sending me your full docker-compose.yml
(removing any personal identifying information or secrets)? You can send it to saba@khoj.dev. I'll check it for any possible formatting issues and test it locally. If I'm not able to reproduce it, it may be something particular to the k8s environment.
If you can share the specs (RAM, CPU, GPU, etc) of the machine, this can also rule out any issues related to machine constraints.
More specifically, the error is being thrown here in the uvicorn
library:
server = await loop.create_server(
create_protocol,
host=config.host,
port=config.port,
ssl=config.ssl,
backlog=config.backlog,
)
@asg0451 , I have a repro! Are you providing the CLI args in your docker-compose.yml
in this format? It must be like this, it can't be in the list sort of format you shared in the first message.
command: --host="0.0.0.0" --port=42110 -v --anonymous-mode
Thanks @sabaimran . Here is the k8s manifest i'm using:
apiVersion: v1
kind: Namespace
metadata:
name: khoj
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database
namespace: khoj
spec:
selector:
matchLabels:
app: database
serviceName: database
replicas: 1
template:
metadata:
labels:
app: database
spec:
containers:
- name: database
image: pgvector/pgvector:pg16
env:
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
value: postgres
- name: POSTGRES_DB
value: postgres
ports:
- containerPort: 5432
name: psql
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data/
resources:
requests:
memory: "128Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 10
successThreshold: 1
failureThreshold: 3
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
# headless svc
---
apiVersion: v1
kind: Service
metadata:
name: database
namespace: khoj
spec:
clusterIP: None
selector:
app: database
ports:
- port: 5432
targetPort: 5432
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: khoj
namespace: khoj
spec:
selector:
matchLabels:
app: khoj
serviceName: khoj
replicas: 1
template:
metadata:
labels:
app: khoj
spec:
containers:
- name: khoj
image: ghcr.io/khoj-ai/khoj:latest
ports:
- containerPort: 42110
name: web
volumeMounts:
- name: config
mountPath: /root/.khoj
- name: models
mountPath: /root/.cache/torch/sentence_transformers
env:
- name: POSTGRES_DB
value: postgres
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
value: postgres
- name: POSTGRES_HOST
value: database-0.database.khoj.svc.cluster.local
- name: POSTGRES_PORT
value: "5432"
- name: KHOJ_DJANGO_SECRET_KEY
value: secret
- name: KHOJ_ADMIN_EMAIL
value: miles.frankel@gmail.com
- name: KHOJ_ADMIN_PASSWORD
value: password
- name: KHOJ_DOMAIN
value: "khoj.beagle-chickadee.ts.net"
- name: KHOJ_DEBUG
value: "True"
resources:
requests:
memory: "1Gi"
cpu: "100m"
limits:
memory: "2Gi"
cpu: "1"
args:
- --host="0.0.0.0"
- --port=42110
- -v
- --anonymous-mode
volumeClaimTemplates:
- metadata:
name: models
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi
storageClassName: local-path
- metadata:
name: config
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi
---
apiVersion: v1
kind: Service
metadata:
name: khoj
namespace: khoj
spec:
selector:
app: khoj
ports:
- port: 42110
targetPort: 42110
you should be able to reproduce the issue by starting a kind or minikube local cluster and then applying the above manifest with kubectl --context kind-kind -f <file.yaml>
, then wait for pods to start (image pull takes a bit) with kubectl --context kind-kind -n khoj get pod/khoj-0
, then watch logs with kubectl --context kind-kind -n khoj logs -f pods/khoj-0
Ahh, okay @asg0451 , I have very limited experience with k8s, so take this with a grain of salt, but can you update your syntax to this format and try again? The quotes might be messing it up. I'm certain that the error is in the way the args are being passed, if you want to experiment with this.
args:
- --host=0.0.0.0
- --port=42110
- -v
- --anonymous-mode
ugh that's definitely the issue. thanks for the help lol working fine now 😅
One last little thing -- i was unable to open django settings via my custom domain. after being presented with a login screen, i get this error.
workaround: port-forward from localhost
One last little thing -- i was unable to open django settings via my custom domain. after being presented with a login screen, i get this error.
workaround: port-forward from localhost
Yeah, a bunch of other folks have also complained about this. We need to check why, how to resolve this for custom domains. Just haven't gotten around to it yet. Glad the port-forward from localhost provided a workaround.
I'll close this issue for now given your original issue of using k8s to self-host Khoj got resolved. But feel free to reopen it if I missed anything
Trying again after my previous issue (https://github.com/khoj-ai/khoj/issues/684) -- now i encounter another problem. khoj starts up properly, but the container dies as soon as i try to connect to it
i'm running it in k8s with
env (part of it):
(though changing or removing this doesnt seem to make a difference)
flags:
self-hosted, k8s, docker image, latest, linux x86-64