[ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

hatchet-dev / hatchet

A distributed, fault-tolerant task queue

https://hatchet.run

MIT License

4.21k stars 157 forks source link

[ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) #768

Closed aaronlewisblenchal closed 1 month ago

aaronlewisblenchal commented 3 months ago

Currently, I am using the following documentation for self hosting Hatchet: https://docs.hatchet.run/self-hosting I have followed the steps listed in the above documentation, but I'm not able to proceed due to the following error:

[ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

I have tried using the config that was pushed to main branch last week SERVER_GRPC_MAX_MSG_SIZE= 2147483648

But still the same error persists

@abelanger5 @grutt @steebchen

abelanger5 commented 3 months ago

Hi @aaronlewisblenchal, just to double check, which version of Hatchet docker image are you running?

aaronlewisblenchal commented 3 months ago

Hey @abelanger5, I was earlier using v0.32.0 but realised your changes are merged with main in v0.40.0 Tried installing v0.40.0, but getting the below error

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/hatchet/hatchet-engine": stat /hatchet/hatchet-engine: no such file or directory: unknown

abelanger5 commented 3 months ago

Ah yeah we introduced this recently. Could you share how you're installing? I've confirmed that /hatchet/hatchet-engine should exist in the containers.

aaronlewisblenchal commented 3 months ago

Yeah sure, I have followed the steps mentioned in the kubernetes quickstart. Pasting below the contents of the YAML file.


api:
  enabled: true
  image:
    repository: ghcr.io/hatchet-dev/hatchet/hatchet-api
    tag: v0.40.0
    pullPolicy: Always
  env:
    SERVER_AUTH_COOKIE_SECRETS: "<secret>"
    SERVER_ENCRYPTION_MASTER_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "<keyset>"
engine:
  enabled: true
  image:
    repository: ghcr.io/hatchet-dev/hatchet/hatchet-engine
    tag: v0.40.0
    pullPolicy: Always
  env:
    SERVER_AUTH_COOKIE_SECRETS: "<secret>"
    SERVER_ENCRYPTION_MASTER_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "<keyset>"

aaronlewisblenchal commented 3 months ago

I purged the above volumes and created new one's, now I am getting the below error.

hatchet-engine

k logs hatchet-engine-7d4648dbbc-6dvzs
2024/08/07 15:33:27 Loading server config from
2024/08/07 15:33:27 Shared file path: server.yaml
2024-08-07T15:33:28.071Z DBG Fetching security alerts for version v0.40.0 service=server
2024-08-07T15:33:28.079Z DBG Error fetching security alerts: ERROR: relation "SecurityCheckIdent" does not exist (SQLSTATE 42P01) service=server
2024-08-07T15:33:28.08Z DBG subscribing to queue: event_processing_queue_v2 service=events-controller
2024/08/07 15:33:28 engine failure: could not run with config: could not create rebalance controller partitions job: could not create engine partition: ERROR: relation "ControllerPartition" does not exist (SQLSTATE 42P01)

hatchet-postgres-db

024-08-07 15:52:46.901 GMT [1] LOG:  database system is ready to accept connections
2024-08-07 15:53:05.215 GMT [174] ERROR:  relation "_prisma_migrations" does not exist at character 28
2024-08-07 15:53:05.215 GMT [174] STATEMENT:  SELECT migration_name FROM _prisma_migrations ORDER BY started_at DESC LIMIT 1;
2024-08-07 15:55:29.814 GMT [384] ERROR:  relation "_prisma_migrations" does not exist at character 28
2024-08-07 15:55:29.814 GMT [384] STATEMENT:  SELECT migration_name FROM _prisma_migrations ORDER BY started_at DESC LIMIT 1;
2024-08-07 15:55:52.901 GMT [422] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:52.901 GMT [422] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:52.914 GMT [445] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:52.914 GMT [445] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:54.157 GMT [468] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:54.157 GMT [468] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:54.160 GMT [468] ERROR:  relation "ControllerPartition" does not exist at character 53
2024-08-07 15:55:54.160 GMT [468] STATEMENT:  -- name: CreateControllerPartition :one
        INSERT INTO "ControllerPartition" ("id", "createdAt", "lastHeartbeat")
        VALUES ($1::text, NOW(), NOW())
        ON CONFLICT DO NOTHING
        RETURNING id, "createdAt", "updatedAt", "lastHeartbeat"
2024-08-07 15:55:56.117 GMT [512] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:56.117 GMT [512] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:56.121 GMT [512] ERROR:  relation "ControllerPartition" does not exist at character 53
2024-08-07 15:55:56.121 GMT [512] STATEMENT:  -- name: CreateControllerPartition :one
        INSERT INTO "ControllerPartition" ("id", "createdAt", "lastHeartbeat")
        VALUES ($1::text, NOW(), NOW())
        ON CONFLICT DO NOTHING
        RETURNING id, "createdAt", "updatedAt", "lastHeartbeat"
2024-08-07 15:55:56.141 GMT [514] LOG:  could not receive data from client: Connection reset by peer

chaitanyakoodoo commented 3 months ago

@abelanger5, please provide a solution for the hatchet-postgres-db issue.

abelanger5 commented 3 months ago

Hey @aaronlewisblenchal and @chaitanyakoodoo, the issue is the migration seems to have failed. There is a migration process that runs as part of the Helm upgrade. This can be tricky to catch because the default delete policy on the Helm hook removes the job almost immediately. If you set the following value in values.yaml you'll be able to catch the migration failure:

debug: true

Once you're able to see this container, could you share the logs from the migration container? It will be called something like hatchet-migration-xxxxx

aaronlewisblenchal commented 3 months ago

Hey @abelanger5, thanks for your help. We have now installed v0.40.0, details of which are listed below:

NAME                                     IMAGE
caddy-5bdcc8d6f6-vcfht                            caddy:2.7.6-alpine
hatchet-engine-c94bc6bfb-kcjms              [ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.40.0)
hatchet-stack-api-7799dd9748-4zmml       [ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0)
hatchet-stack-api-7799dd9748-9nfnd       [ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0)
hatchet-stack-frontend-f5f98fcf8-dpv8x   [ghcr.io/hatchet-dev/hatchet/hatchet-frontend:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-frontend:v0.40.0)
hatchet-stack-postgres-0                           [docker.io/bitnami/postgresql:16.2.0-debian-12-r8](http://docker.io/bitnami/postgresql:16.2.0-debian-12-r8)
hatchet-stack-rabbitmq-0                          [docker.io/bitnami/rabbitmq:3.12.13-debian-12-r2](http://docker.io/bitnami/rabbitmq:3.12.13-debian-12-r2)

Also, we have added SERVER_GRPC_MAX_MSG_SIZE= 2147483648, but we are still receiving the below error upon running npm run worker:

🪓 31866 | 08/09/24, 03:46:50 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:46:55 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:00 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:05 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:10 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

Can you assist us with this?

abelanger5 commented 3 months ago

Hey @aaronlewisblenchal, apologies for that - I've just released v0.41.2 as latest, which has the env var. This was only available as a YAML config option in v0.40.0. Hopefully that fixes it!

aaronlewisblenchal commented 3 months ago

Hey @abelanger5, no worries, we tried upgrading it to v0.41.2 as you suggested, but still receiving the error on running npm run worker

version: 1.0.0","time":"2024-08-09T13:48:32.305Z","v":0}
🪓 71136 | 08/09/24, 07:18:32 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:37 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:42 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:47 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:52 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

Anything we might be missing here?

abelanger5 commented 3 months ago

I'm unable to reproduce this, I've just tested against larger payloads. Two things to confirm:

You're running the engine container with v0.41.2? GRPC connects directly to the engine, not the API container.
Do you have a proxy or ingress sitting in front of the engine service?

chaitanyakoodoo commented 2 months ago

Hi @abelanger5, we have deployed all services in the v0.41.2 version, yes we have enabled ingress front of the engine. please let me know if you need any thing to review.

engine:
  enabled: true
  nameOverride: hatchet-engine
  fullnameOverride: hatchet-engine
  replicaCount: 1
  image:
    repository: "ghcr.io/hatchet-dev/hatchet/hatchet-engine"
    tag: "v0.41.2"
    pullPolicy: "Always"
  migrationJob:
    enabled: true
  service:
    externalPort: 7070
    internalPort: 7070
  commandline:
    command: ["/hatchet/hatchet-engine"]
  deployment:
    annotations:
      app.kubernetes.io/name: hatchet-engine
  serviceAccount:
    create: true
    name: hatchet-engine
  env:
    SERVER_AUTH_COOKIE_SECRETS: "secretvalue"
    SERVER_ENCRYPTION_MASTER_KEYSET: "secretvalue"
    SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "secretvalue"
    SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "secretvalue"
    SERVER_AUTH_COOKIE_INSECURE: "t"
    SERVER_AUTH_SET_EMAIL_VERIFIED: "t"
    SERVER_LOGGER_LEVEL: "debug"
    SERVER_LOGGER_FORMAT: "console"
    DATABASE_LOGGER_LEVEL: "debug"
    DATABASE_LOGGER_FORMAT: "console"
    SERVER_AUTH_GOOGLE_ENABLED: "f"
    SERVER_AUTH_BASIC_AUTH_ENABLED: "t"
    DATABASE_URL: "postgres://secretvalue:secretvalue@hatchet-stack-postgres:5432/hatchet?sslmode=disable"
    DATABASE_POSTGRES_HOST: "hatchet-stack-postgres"
    DATABASE_POSTGRES_PORT: "5432"
    DATABASE_POSTGRES_USERNAME: "secretvalue"
    DATABASE_POSTGRES_PASSWORD: "secretvalue"
    DATABASE_POSTGRES_DB_NAME: "hatchet"
    DATABASE_POSTGRES_SSL_MODE: "disable"
    SERVER_TASKQUEUE_RABBITMQ_URL: "amqp://hatchet:hatchet@hatchet-stack-rabbitmq:5672/"
    SERVER_AUTH_COOKIE_DOMAIN: "secretvalue.secretvalue.io"
    SERVER_URL: "https://secretvalue.secretvalue.io"
    SERVER_GRPC_BIND_ADDRESS: "0.0.0.0"
    SERVER_GRPC_INSECURE: "false"
    SERVER_GRPC_BROADCAST_ADDRESS: "secretvalue.secretvalue.io:443"
    SERVER_GRPC_MAX_MSG_SIZE: "2147483648"
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/ssl-redirect: "true"
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
      nginx.ingress.kubernetes.io/grpc-backend: "true"

    hosts:
      - host: secretvalue.secretvalue.io
        paths:
          - path: /
            backend:
              serviceName: hatchet-engine
              servicePort: 7070
    tls:
      - hosts:
          - secretvalue.secretvalue.io
        secretName: testcertificate

chaitanyakoodoo commented 2 months ago

@abelanger5, please provide a solution for this issue

version: 1.0.0","time":"2024-08-09T13:48:32.305Z","v":0} 🪓 71136 | 08/09/24, 07:18:32 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:37 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:42 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:47 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:52 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

abelanger5 commented 2 months ago

The config looks correct. A few follow-up questions:

Could you share the output of kubectl describe <hatchet-engine-pod>? (with sensitive values redacted)
I am also wondering if perhaps this is due to a max body size constraint on the NGINX grpc proxy. Perhaps you can try increasing the nginx max body size?
The logs indicate that you are calling PutWorkflow with a nearly 2GB workflow. What is the use-case for why the workflow is such a large size?

chaitanyakoodoo commented 2 months ago

@abelanger5 I also noticed that the endpoint https://engine..io/ returns '403 Forbidden' when I try to access it from a browser.

output of kubectl describe

Name:             hatchet-engine-69dc85c95d-dwdp9
Namespace:        hatchet
Priority:         0
Service Account:  hatchet-engine
Node:             <secretvalue>
Start Time:       Tue, 13 Aug 2024 10:38:24 +0530
Labels:           app.kubernetes.io/instance=hatchet-stack
              app.kubernetes.io/name=hatchet-engine
              pod-template-hash=69dc85c95d
Annotations:      app.kubernetes.io/name: hatchet-engine
              cni.projectcalico.org/containerID: 7133ad58076748106af8dcce7318413e00235ab1f6f191022fd7b08840fe7720
              cni.projectcalico.org/podIP: 10.0.130.86/32
              cni.projectcalico.org/podIPs: 10.0.130.86/32
Status:           Running
IP:               10.0.130.86
IPs:
IP:           10.0.130.86
Controlled By:  ReplicaSet/hatchet-engine-69dc85c95d
Containers:
engine:
Container ID:  containerd://54c7178b0a665754621d4e6c4bbb27e68a215d5a103c2c70ec955d7c83e6e143
Image:         ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest
Image ID:      ghcr.io/hatchet-dev/hatchet/hatchet-engine@sha256:3bd98ea205d730b7435ed4426a7369a16548bc36d6df9bb758f745baa2281b52
Port:          7070/TCP
Host Port:     0/TCP
Command:
  /hatchet/hatchet-engine
State:          Running
  Started:      Tue, 13 Aug 2024 10:38:25 +0530
Ready:          True
Restart Count:  0
Limits:
  memory:  1Gi
Requests:
  cpu:      250m
  memory:   1Gi
Liveness:   http-get http://:8733/live delay=60s timeout=1s period=5s #success=1 #failure=3
Readiness:  http-get http://:8733/ready delay=20s timeout=1s period=5s #success=1 #failure=3
Environment:
  DATABASE_LOGGER_FORMAT:                console
  DATABASE_LOGGER_LEVEL:                 debug
  DATABASE_POSTGRES_DB_NAME:             hatchet
  DATABASE_POSTGRES_HOST:                <secretvalue>
  DATABASE_POSTGRES_PASSWORD:            <secretvalue>
  DATABASE_POSTGRES_PORT:                5432
  DATABASE_POSTGRES_SSL_MODE:            disable
  DATABASE_POSTGRES_USERNAME:            <secretvalue>
  DATABASE_URL:                          postgres://<secretvalue>:<secretvalue>@hatchet-stack-postgres:5432/hatchet?sslmode=disable
  SERVER_AUTH_BASIC_AUTH_ENABLED:        t
  SERVER_AUTH_COOKIE_DOMAIN:             sandbox-hatchet.<secretvalue>.io
  SERVER_AUTH_COOKIE_INSECURE:           t
  SERVER_AUTH_COOKIE_SECRETS:            <secretvalue>
  SERVER_AUTH_GOOGLE_ENABLED:            f
  SERVER_AUTH_SET_EMAIL_VERIFIED:        t
  SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET:  <secretvalue>
  SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET:   <secretvalue>
  SERVER_ENCRYPTION_MASTER_KEYSET:       <secretvalue>
  SERVER_GRPC_BIND_ADDRESS:              0.0.0.0
  SERVER_GRPC_BROADCAST_ADDRESS:         engine.<secretvalue>.io:443
  SERVER_GRPC_INSECURE:                  false
  SERVER_GRPC_MAX_MSG_SIZE:              2147483648
  SERVER_LOGGER_FORMAT:                  console
  SERVER_LOGGER_LEVEL:                   debug
  SERVER_TASKQUEUE_RABBITMQ_URL:         amqp://<secretvalue>:<secretvalue>@hatchet-stack-rabbitmq:5672/
  SERVER_URL:                            https://sandbox-hatchet.<secretvalue>.io
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nj562 (ro)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
kube-api-access-nj562:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                         node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type    Reason     Age   From               Message
----    ------     ----  ----               -------
Normal  Scheduled  86s   default-scheduler  Successfully assigned hatchet/hatchet-engine-69dc85c95d-dwdp9 to <secretvalue>
Normal  Pulling    86s   kubelet            Pulling image "ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest"
Normal  Pulled     85s   kubelet            Successfully pulled image "ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest" in 426.680289ms (426.700126ms including waiting)
Normal  Created    85s   kubelet            Created container engine
Normal  Started    85s   kubelet            Started container engine

aaronlewisblenchal commented 2 months ago

Hey @abelanger5, Regarding point 3, the 2 GB Put workflow is not something we are passing and seems to be happening when initialising hatchet. However, the same seems to be working fine when we are testing it on SaaS version of hatchet.

abelanger5 commented 2 months ago

Thanks, I think I know what's happening - I'm pretty sure this limit is being set on the client-side, not the server (I've been trying to recreate with the Go client, but grpc clients are not consistent with this type of configuration). I'll test out the Typescript SDK later today and allow the value to be configurable there as well if it turns out to be the problem.

The very large payload is an issue, it would be good to track down why this is happening. Could you share a stubbed out version of the workflow that you're defining?

chaitanyakoodoo commented 2 months ago

Hi @abelanger5 , I have tested quick setup steps, I am able to create a workflow using port forwarding, but when I try using ingress. I am getting the error, can you please send the correct configuration I need to add? https://docs.hatchet.run/self-hosting/kubernetes-quickstart

if possible can we connect on call to resolve the issue.. 🪓 71136 | 08/09/24, 07:18:32 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:37 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:42 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:47 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:52 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

abelanger5 commented 2 months ago

Absolutely, you can grab a time with one of us here: https://cal.com/team/hatchet/founders

abelanger5 commented 1 month ago

Closing this issue as I believe we've tracked down all causes of this error. For reference, this error shows up for the following reasons:

Payloads are larger than 4MB, or a step is dependent on parent step outputs whose combined payload size is larger than 4MB. To avoid this, you can set the env var SERVER_GRPC_MAX_MSG_SIZE on the server. Depending on the client, a corresponding limit may need to be set on the SDKs. In the python SDK, this corresponds to HATCHET_GRPC_MAX_RECV_MESSAGE_LENGTH and HATCHET_GRPC_MAX_SEND_MESSAGE_LENGTH. I've created https://github.com/hatchet-dev/hatchet-typescript/issues/367 for the Typescript SDK.
There is an issue with the SSL configuration between the client and server which can sometimes manifest as a RESOURCE_EXHAUSTED error. The most common cause of this seems to be a Cloudflare proxy which requires SSL, and most users have had luck turning off the CF proxy to avoid this issue. Other causes seem to be setting HATCHET_CLIENT_TLS_STRATEGY=none when TLS is actually required.
A connection reset occurs from a different proxy, such as nginx-ingress, caused by an idle timeout when the worker does not receive messages for the proxy's client read timeout. All clients will automatically reconnect, so this shouldn't be an issue, except for a warning log in the console.