hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.12k stars 2.76k forks source link

Hasura Engine RAM consumption grows indefinitely #10056

Open abseht opened 10 months ago

abseht commented 10 months ago

Version Information

VERSION=v2.33.3 Server Version: VERSION=v2.33.3

Environment

Kubernetes deployment. Nothing special in charts or values...

What is the current behaviour?

While CPU consumption resembles the actual load, the RAM consumption grows indefinitely image Such behaviour became apparent after HPA limits were raised. Otherwise, K8s just kills pods very frequently. Such behaviour drives managed Postgres instance crazy too. image

What is the expected behaviour?

Service resource consumption rise and fall together with load

How to reproduce the issue?

Deploy hasura instance in K8s and apply intermittent load. Wait for a couple of days.

Any possible solutions/workarounds you're aware of?

We have been struggling with issues related to Hasura stability for some time. We tried to adjust queries and parameters in our services but it did not change the picture. Pods were dying. However, after we significantly raised HPA limits for Hasura, pods stopped dying but our database management started to go in 'failover mode' out of blue and graph started showing such steady consumption growth. As of now, the workaround is to kill pods as soon as they become 3 days old.

Please consider fixing it. Also, are there average numbers for Hasura consumption and other metrics.

Thank you!

Keywords

RAM, resource consumption

tirumaraiselvan commented 10 months ago

Hi @abseht

  1. Do you see the same runtime characteristics with 2.32 or earlier?
  2. What kind of workload are you running your Hasura? The more detailed the better to help in diagnosis.
abseht commented 9 months ago

Hello @tirumaraiselvan !

  1. Yes. We have been using Hasura for a while. We did see similar behavior on previous versions. For example, two versions we had been deploying in the past are 2.12.1 and 2.32.1
  2. On one end we have 10-ish PostgreSQL databases. Most of them has Postgis extension and managed by Crunchy. One has dblink. We have a lot of geometry data in our tables. On the other end we have 15-ish services. The regular workload is to read and write rows with Point, Polygon, MultiPolygon data types. Also, we use Subscriptions on some of our web-applications. Please let me know if I can help you further more. Thank you!
delaurentis commented 5 months ago

Hi @abseht , how did you produce those graphs? And raise the HPA limit for the hasura container? I wonder if this might help me debug an issue my team is experiencing.

I'm debugging an issue where when run inside Kubernetes on a local development machine (with minikube), the Hasura Admin UI becomes exceedingly slow.

https://github.com/hasura/graphql-engine/discussions/10214

abseht commented 5 months ago

Hi @abseht , how did you produce those graphs? And raise the HPA limit for the hasura container? I wonder if this might help me debug an issue my team is experiencing.

I'm debugging an issue where when run inside Kubernetes on a local development machine (with minikube), the Hasura Admin UI becomes exceedingly slow.

https://github.com/hasura/graphql-engine/discussions/10214

Hello @delaurentis The top graph is visualization of basic K8s metrics implemented in Kibana. The bottom graph is copied from a console of our database provider Crunchy. Not sure how to reproduce these metrics in minikube.