Closed aaronlewisblenchal closed 1 month ago
Hi @aaronlewisblenchal, just to double check, which version of Hatchet docker image are you running?
Hey @abelanger5, I was earlier using v0.32.0 but realised your changes are merged with main in v0.40.0 Tried installing v0.40.0, but getting the below error
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/hatchet/hatchet-engine": stat /hatchet/hatchet-engine: no such file or directory: unknown
Ah yeah we introduced this recently. Could you share how you're installing? I've confirmed that /hatchet/hatchet-engine
should exist in the containers.
Yeah sure, I have followed the steps mentioned in the kubernetes quickstart. Pasting below the contents of the YAML file.
api:
enabled: true
image:
repository: ghcr.io/hatchet-dev/hatchet/hatchet-api
tag: v0.40.0
pullPolicy: Always
env:
SERVER_AUTH_COOKIE_SECRETS: "<secret>"
SERVER_ENCRYPTION_MASTER_KEYSET: "<keyset>"
SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "<keyset>"
SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "<keyset>"
engine:
enabled: true
image:
repository: ghcr.io/hatchet-dev/hatchet/hatchet-engine
tag: v0.40.0
pullPolicy: Always
env:
SERVER_AUTH_COOKIE_SECRETS: "<secret>"
SERVER_ENCRYPTION_MASTER_KEYSET: "<keyset>"
SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "<keyset>"
SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "<keyset>"
I purged the above volumes and created new one's, now I am getting the below error.
hatchet-engine
k logs hatchet-engine-7d4648dbbc-6dvzs
2024/08/07 15:33:27 Loading server config from
2024/08/07 15:33:27 Shared file path: server.yaml
2024-08-07T15:33:28.071Z DBG Fetching security alerts for version v0.40.0 service=server
2024-08-07T15:33:28.079Z DBG Error fetching security alerts: ERROR: relation "SecurityCheckIdent" does not exist (SQLSTATE 42P01) service=server
2024-08-07T15:33:28.08Z DBG subscribing to queue: event_processing_queue_v2 service=events-controller
2024/08/07 15:33:28 engine failure: could not run with config: could not create rebalance controller partitions job: could not create engine partition: ERROR: relation "ControllerPartition" does not exist (SQLSTATE 42P01)
hatchet-postgres-db
024-08-07 15:52:46.901 GMT [1] LOG: database system is ready to accept connections
2024-08-07 15:53:05.215 GMT [174] ERROR: relation "_prisma_migrations" does not exist at character 28
2024-08-07 15:53:05.215 GMT [174] STATEMENT: SELECT migration_name FROM _prisma_migrations ORDER BY started_at DESC LIMIT 1;
2024-08-07 15:55:29.814 GMT [384] ERROR: relation "_prisma_migrations" does not exist at character 28
2024-08-07 15:55:29.814 GMT [384] STATEMENT: SELECT migration_name FROM _prisma_migrations ORDER BY started_at DESC LIMIT 1;
2024-08-07 15:55:52.901 GMT [422] ERROR: relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:52.901 GMT [422] STATEMENT: -- name: GetSecurityCheckIdent :one
SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:52.914 GMT [445] ERROR: relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:52.914 GMT [445] STATEMENT: -- name: GetSecurityCheckIdent :one
SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:54.157 GMT [468] ERROR: relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:54.157 GMT [468] STATEMENT: -- name: GetSecurityCheckIdent :one
SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:54.160 GMT [468] ERROR: relation "ControllerPartition" does not exist at character 53
2024-08-07 15:55:54.160 GMT [468] STATEMENT: -- name: CreateControllerPartition :one
INSERT INTO "ControllerPartition" ("id", "createdAt", "lastHeartbeat")
VALUES ($1::text, NOW(), NOW())
ON CONFLICT DO NOTHING
RETURNING id, "createdAt", "updatedAt", "lastHeartbeat"
2024-08-07 15:55:56.117 GMT [512] ERROR: relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:56.117 GMT [512] STATEMENT: -- name: GetSecurityCheckIdent :one
SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:56.121 GMT [512] ERROR: relation "ControllerPartition" does not exist at character 53
2024-08-07 15:55:56.121 GMT [512] STATEMENT: -- name: CreateControllerPartition :one
INSERT INTO "ControllerPartition" ("id", "createdAt", "lastHeartbeat")
VALUES ($1::text, NOW(), NOW())
ON CONFLICT DO NOTHING
RETURNING id, "createdAt", "updatedAt", "lastHeartbeat"
2024-08-07 15:55:56.141 GMT [514] LOG: could not receive data from client: Connection reset by peer
@abelanger5, please provide a solution for the hatchet-postgres-db issue.
Hey @aaronlewisblenchal and @chaitanyakoodoo, the issue is the migration seems to have failed. There is a migration process that runs as part of the Helm upgrade. This can be tricky to catch because the default delete policy on the Helm hook removes the job almost immediately. If you set the following value in values.yaml
you'll be able to catch the migration failure:
debug: true
Once you're able to see this container, could you share the logs from the migration container? It will be called something like hatchet-migration-xxxxx
Hey @abelanger5, thanks for your help. We have now installed v0.40.0, details of which are listed below:
NAME IMAGE
caddy-5bdcc8d6f6-vcfht caddy:2.7.6-alpine
hatchet-engine-c94bc6bfb-kcjms [ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.40.0)
hatchet-stack-api-7799dd9748-4zmml [ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0)
hatchet-stack-api-7799dd9748-9nfnd [ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0)
hatchet-stack-frontend-f5f98fcf8-dpv8x [ghcr.io/hatchet-dev/hatchet/hatchet-frontend:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-frontend:v0.40.0)
hatchet-stack-postgres-0 [docker.io/bitnami/postgresql:16.2.0-debian-12-r8](http://docker.io/bitnami/postgresql:16.2.0-debian-12-r8)
hatchet-stack-rabbitmq-0 [docker.io/bitnami/rabbitmq:3.12.13-debian-12-r2](http://docker.io/bitnami/rabbitmq:3.12.13-debian-12-r2)
Also, we have added SERVER_GRPC_MAX_MSG_SIZE= 2147483648
, but we are still receiving the below error upon running npm run worker
:
🪓 31866 | 08/09/24, 03:46:50 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:46:55 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:00 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:05 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:10 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
Can you assist us with this?
Hey @aaronlewisblenchal, apologies for that - I've just released v0.41.2
as latest, which has the env var. This was only available as a YAML config option in v0.40.0
. Hopefully that fixes it!
Hey @abelanger5, no worries, we tried upgrading it to v0.41.2
as you suggested, but still receiving the error on running npm run worker
version: 1.0.0","time":"2024-08-09T13:48:32.305Z","v":0}
🪓 71136 | 08/09/24, 07:18:32 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:37 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:42 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:47 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:52 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
Anything we might be missing here?
I'm unable to reproduce this, I've just tested against larger payloads. Two things to confirm:
engine
container with v0.41.2
? GRPC connects directly to the engine, not the API container.Hi @abelanger5, we have deployed all services in the v0.41.2 version, yes we have enabled ingress front of the engine. please let me know if you need any thing to review.
engine:
enabled: true
nameOverride: hatchet-engine
fullnameOverride: hatchet-engine
replicaCount: 1
image:
repository: "ghcr.io/hatchet-dev/hatchet/hatchet-engine"
tag: "v0.41.2"
pullPolicy: "Always"
migrationJob:
enabled: true
service:
externalPort: 7070
internalPort: 7070
commandline:
command: ["/hatchet/hatchet-engine"]
deployment:
annotations:
app.kubernetes.io/name: hatchet-engine
serviceAccount:
create: true
name: hatchet-engine
env:
SERVER_AUTH_COOKIE_SECRETS: "secretvalue"
SERVER_ENCRYPTION_MASTER_KEYSET: "secretvalue"
SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "secretvalue"
SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "secretvalue"
SERVER_AUTH_COOKIE_INSECURE: "t"
SERVER_AUTH_SET_EMAIL_VERIFIED: "t"
SERVER_LOGGER_LEVEL: "debug"
SERVER_LOGGER_FORMAT: "console"
DATABASE_LOGGER_LEVEL: "debug"
DATABASE_LOGGER_FORMAT: "console"
SERVER_AUTH_GOOGLE_ENABLED: "f"
SERVER_AUTH_BASIC_AUTH_ENABLED: "t"
DATABASE_URL: "postgres://secretvalue:secretvalue@hatchet-stack-postgres:5432/hatchet?sslmode=disable"
DATABASE_POSTGRES_HOST: "hatchet-stack-postgres"
DATABASE_POSTGRES_PORT: "5432"
DATABASE_POSTGRES_USERNAME: "secretvalue"
DATABASE_POSTGRES_PASSWORD: "secretvalue"
DATABASE_POSTGRES_DB_NAME: "hatchet"
DATABASE_POSTGRES_SSL_MODE: "disable"
SERVER_TASKQUEUE_RABBITMQ_URL: "amqp://hatchet:hatchet@hatchet-stack-rabbitmq:5672/"
SERVER_AUTH_COOKIE_DOMAIN: "secretvalue.secretvalue.io"
SERVER_URL: "https://secretvalue.secretvalue.io"
SERVER_GRPC_BIND_ADDRESS: "0.0.0.0"
SERVER_GRPC_INSECURE: "false"
SERVER_GRPC_BROADCAST_ADDRESS: "secretvalue.secretvalue.io:443"
SERVER_GRPC_MAX_MSG_SIZE: "2147483648"
ingress:
enabled: true
ingressClassName: nginx
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/grpc-backend: "true"
hosts:
- host: secretvalue.secretvalue.io
paths:
- path: /
backend:
serviceName: hatchet-engine
servicePort: 7070
tls:
- hosts:
- secretvalue.secretvalue.io
secretName: testcertificate
@abelanger5, please provide a solution for this issue
version: 1.0.0","time":"2024-08-09T13:48:32.305Z","v":0} 🪓 71136 | 08/09/24, 07:18:32 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:37 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:42 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:47 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:52 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
The config looks correct. A few follow-up questions:
kubectl describe <hatchet-engine-pod>
? (with sensitive values redacted)PutWorkflow
with a nearly 2GB workflow. What is the use-case for why the workflow is such a large size? @abelanger5 I also noticed that the endpoint https://engine.
Name: hatchet-engine-69dc85c95d-dwdp9
Namespace: hatchet
Priority: 0
Service Account: hatchet-engine
Node: <secretvalue>
Start Time: Tue, 13 Aug 2024 10:38:24 +0530
Labels: app.kubernetes.io/instance=hatchet-stack
app.kubernetes.io/name=hatchet-engine
pod-template-hash=69dc85c95d
Annotations: app.kubernetes.io/name: hatchet-engine
cni.projectcalico.org/containerID: 7133ad58076748106af8dcce7318413e00235ab1f6f191022fd7b08840fe7720
cni.projectcalico.org/podIP: 10.0.130.86/32
cni.projectcalico.org/podIPs: 10.0.130.86/32
Status: Running
IP: 10.0.130.86
IPs:
IP: 10.0.130.86
Controlled By: ReplicaSet/hatchet-engine-69dc85c95d
Containers:
engine:
Container ID: containerd://54c7178b0a665754621d4e6c4bbb27e68a215d5a103c2c70ec955d7c83e6e143
Image: ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest
Image ID: ghcr.io/hatchet-dev/hatchet/hatchet-engine@sha256:3bd98ea205d730b7435ed4426a7369a16548bc36d6df9bb758f745baa2281b52
Port: 7070/TCP
Host Port: 0/TCP
Command:
/hatchet/hatchet-engine
State: Running
Started: Tue, 13 Aug 2024 10:38:25 +0530
Ready: True
Restart Count: 0
Limits:
memory: 1Gi
Requests:
cpu: 250m
memory: 1Gi
Liveness: http-get http://:8733/live delay=60s timeout=1s period=5s #success=1 #failure=3
Readiness: http-get http://:8733/ready delay=20s timeout=1s period=5s #success=1 #failure=3
Environment:
DATABASE_LOGGER_FORMAT: console
DATABASE_LOGGER_LEVEL: debug
DATABASE_POSTGRES_DB_NAME: hatchet
DATABASE_POSTGRES_HOST: <secretvalue>
DATABASE_POSTGRES_PASSWORD: <secretvalue>
DATABASE_POSTGRES_PORT: 5432
DATABASE_POSTGRES_SSL_MODE: disable
DATABASE_POSTGRES_USERNAME: <secretvalue>
DATABASE_URL: postgres://<secretvalue>:<secretvalue>@hatchet-stack-postgres:5432/hatchet?sslmode=disable
SERVER_AUTH_BASIC_AUTH_ENABLED: t
SERVER_AUTH_COOKIE_DOMAIN: sandbox-hatchet.<secretvalue>.io
SERVER_AUTH_COOKIE_INSECURE: t
SERVER_AUTH_COOKIE_SECRETS: <secretvalue>
SERVER_AUTH_GOOGLE_ENABLED: f
SERVER_AUTH_SET_EMAIL_VERIFIED: t
SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: <secretvalue>
SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: <secretvalue>
SERVER_ENCRYPTION_MASTER_KEYSET: <secretvalue>
SERVER_GRPC_BIND_ADDRESS: 0.0.0.0
SERVER_GRPC_BROADCAST_ADDRESS: engine.<secretvalue>.io:443
SERVER_GRPC_INSECURE: false
SERVER_GRPC_MAX_MSG_SIZE: 2147483648
SERVER_LOGGER_FORMAT: console
SERVER_LOGGER_LEVEL: debug
SERVER_TASKQUEUE_RABBITMQ_URL: amqp://<secretvalue>:<secretvalue>@hatchet-stack-rabbitmq:5672/
SERVER_URL: https://sandbox-hatchet.<secretvalue>.io
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nj562 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-nj562:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 86s default-scheduler Successfully assigned hatchet/hatchet-engine-69dc85c95d-dwdp9 to <secretvalue>
Normal Pulling 86s kubelet Pulling image "ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest"
Normal Pulled 85s kubelet Successfully pulled image "ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest" in 426.680289ms (426.700126ms including waiting)
Normal Created 85s kubelet Created container engine
Normal Started 85s kubelet Started container engine
Hey @abelanger5, Regarding point 3, the 2 GB Put workflow is not something we are passing and seems to be happening when initialising hatchet. However, the same seems to be working fine when we are testing it on SaaS version of hatchet.
Thanks, I think I know what's happening - I'm pretty sure this limit is being set on the client-side, not the server (I've been trying to recreate with the Go client, but grpc clients are not consistent with this type of configuration). I'll test out the Typescript SDK later today and allow the value to be configurable there as well if it turns out to be the problem.
The very large payload is an issue, it would be good to track down why this is happening. Could you share a stubbed out version of the workflow that you're defining?
Hi @abelanger5 , I have tested quick setup steps, I am able to create a workflow using port forwarding, but when I try using ingress. I am getting the error, can you please send the correct configuration I need to add? https://docs.hatchet.run/self-hosting/kubernetes-quickstart
if possible can we connect on call to resolve the issue.. 🪓 71136 | 08/09/24, 07:18:32 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:37 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:42 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:47 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:52 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
Absolutely, you can grab a time with one of us here: https://cal.com/team/hatchet/founders
Closing this issue as I believe we've tracked down all causes of this error. For reference, this error shows up for the following reasons:
Payloads are larger than 4MB, or a step is dependent on parent step outputs whose combined payload size is larger than 4MB. To avoid this, you can set the env var SERVER_GRPC_MAX_MSG_SIZE
on the server. Depending on the client, a corresponding limit may need to be set on the SDKs. In the python SDK, this corresponds to HATCHET_GRPC_MAX_RECV_MESSAGE_LENGTH
and HATCHET_GRPC_MAX_SEND_MESSAGE_LENGTH
. I've created https://github.com/hatchet-dev/hatchet-typescript/issues/367 for the Typescript SDK.
There is an issue with the SSL configuration between the client and server which can sometimes manifest as a RESOURCE_EXHAUSTED
error. The most common cause of this seems to be a Cloudflare proxy which requires SSL, and most users have had luck turning off the CF proxy to avoid this issue. Other causes seem to be setting HATCHET_CLIENT_TLS_STRATEGY=none
when TLS is actually required.
A connection reset occurs from a different proxy, such as nginx-ingress
, caused by an idle timeout when the worker does not receive messages for the proxy's client read timeout. All clients will automatically reconnect, so this shouldn't be an issue, except for a warning log in the console.
Currently, I am using the following documentation for self hosting Hatchet: https://docs.hatchet.run/self-hosting I have followed the steps listed in the above documentation, but I'm not able to proceed due to the following error:
[ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
I have tried using the config that was pushed to main branch last week SERVER_GRPC_MAX_MSG_SIZE= 2147483648
But still the same error persists
@abelanger5 @grutt @steebchen