hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.2k stars 2.77k forks source link

Memory leak #9592

Open sergeimonakhov opened 1 year ago

sergeimonakhov commented 1 year ago

Version Information

Server Version: v2.22.1 CLI Version (for CLI related issue):

Environment

On-premises

What is the current behaviour?

We've been seeing memory leaks since version v2.19.x and higher

Upgrading to the latest version did not solve the problem, version v2.8.4 works correctly.

What is the expected behaviour?

To work without memory leaks

How to reproduce the issue?

  1. Upgrade to 2.19.x and higher.

Screenshots or Screencast

Screenshot 2023-04-17 at 09 41 33

OOM:

image

Please provide any traces or logs that could help here.

Any possible solutions/workarounds you're aware of?

Keywords

memory leak

tirumaraiselvan commented 1 year ago

HI @sergeimonakhov

What kind of workloads are you running in Hasura in the timeframe provided in your screenshot?

NikPaushkin commented 1 year ago

Hi @tirumaraiselvan. Mutations, queries and subscriptions to PostgreSQL are the only workloads which we use in the background.

NikPaushkin commented 1 year ago

@tirumaraiselvan The only thing we added compared to the previous version is remote schema permissions. I have just noticed, that Hasura stores them all as a huge string in a database and Hasura UI is freezing more and more with each new permission rule added. Given that architecture, there can be related memory issues on a back-end, definitely.

martykan commented 1 year ago

Something is happening for sure. This is a Hasura instance deployed at Railway.app

Screenshot 2023-05-01 at 10 47 55

After restarting the memory usage dropped under 300 MB.

tirumaraiselvan commented 1 year ago

@tirumaraiselvan The only thing we added compared to the previous version is remote schema permissions. I have just noticed, that Hasura stores them all as a huge string in a database and Hasura UI is freezing more and more with each new permission rule added. Given that architecture, there can be related memory issues on a back-end, definitely.

@NikPaushkin Hi, thanks for additional info. Can you confirm if without remote schema permissions, you do not see any constant increase in memory (in v2.22)? The UI issue might be unrelated and solely be a console issue.

NikPaushkin commented 1 year ago

@tirumaraiselvan No, I can't confirm it now. We have 2.24.1 now without those remote schema permissions changes and it's still leaking by 500MB every day.

cheets commented 1 year ago

Hi, we have this same issue. It's affecting multiple clusters where we have Hasura deployed. It's been happening from 2.2x.x versions. Our latest memory leak test with a new cluster on v2.24.1. The cluster was created from scratch (using terraform scripts) and there was no traffic whatsoever and it's leaking memory all the time. We have a 512Mb limit on the pod and it restarts when the limit is reached. You can see that from the attached behaviour. We don't have any remote schemas.

memory-leak-hasura-512mb-limit

tjenkinson commented 1 year ago

We're on 2.25.0 and also seeing a memory leak and then OOM causing restarts. We have some remote schemas.

tirumaraiselvan commented 1 year ago

@cheets @tjenkinson Hey folks, just to confirm once again. You had the exact same metadata in previous versions and it didn't cause a memory growth like in newer versions?

tjenkinson commented 1 year ago

Hey @tirumaraiselvan the metadata may have changed slightly. We were on 2.16.0 before and that appeared to have the same issue

cheets commented 1 year ago

We are actively developing our API so some changes occur every week to metadata. I don't recall anything major thought. We have some actions and couple event triggers but these have been in the metadata for a long time.

What is weird is that we can see this behavior on our K3S on-premise clusters. However we are also running exact same Hasura in Azure AKS and we haven't observed this memory leak there.

tirumaraiselvan commented 1 year ago

@cheets We are unable to reproduce this on idle schema like you mentioned here: https://github.com/hasura/graphql-engine/issues/9592#issuecomment-1551596064

You are saying you can't reproduce this on AKS. Do you want to test with different versions of Hasura on k3s to see if this might be a k3s issue (seems like this is being reported on newer versions so trying something like 2.11 might be a good start)? Also, maybe you can share your scripts so we can reproduce this on our end?

tirumaraiselvan commented 1 year ago

@NikPaushkin @tjenkinson Do you see some kind of leak with no traffic as well?

tjenkinson commented 1 year ago

Our instances are always getting some traffic so not sure on that sorry

tjenkinson commented 1 year ago

I’ve managed to reliably trigger the leak by repeatedly reloading all remote schemas from the console. Every time I do it memory goes up slightly

tirumaraiselvan commented 1 year ago

@tjenkinson Do you also have remote schema permissions?

tjenkinson commented 1 year ago

@tirumaraiselvan this is after logging in with the admin secret

tirumaraiselvan commented 1 year ago

@tjenkinson Is it possible for you to send us your metadata? You can email me at tiru@hasura.io.

tjenkinson commented 1 year ago

hey @tirumaraiselvan unfortunately we are not able to do that due to the nature of what it contains. Noticed we also have some webhooks set up on event triggers. Not sure if that could also be a factor 🤷

tirumaraiselvan commented 1 year ago

@tjenkinson You can send us a smaller version by redacting or removing any sensitive info (it need not even work)? This is just to short circuit lots of metadata related questions we get when trying to reproduce/triage such issues.

Just FYI, we are not able to reproduce this by constantly reloading remote schemas. That's why I wanted to know if you have Remote Schema permissions configured in the metadata as well. Is that the case?

dostalradim commented 1 year ago

We have several instances of hasura and we do not use remote schema anywhere and we observe memory leaks as well. Even on an instance that is minimally used, once or twice a day. We use triggers on almost every one of them, but for example on the one mentioned above they are called about twice a day and memory leaks occur as well. The oldest instance we observe this on is version 2.20.1.

tirumaraiselvan commented 1 year ago

@dostalradim Are you able to share your metadata with us? You can email it to me at tiru@hasura.io (feel free to remove any sensitive info)

dostalradim commented 1 year ago
image

I sent you the metadata of an application that is very rarely visited and its memory usage graph looks like this. Hasura version is 2.20.1

Thank you for investigation.

tirumaraiselvan commented 1 year ago

@dostalradim This is very helpful. Thank you. Could you also help us with the kind of activity that you have on this deployment? Is it queries, mutations, subscriptions, metadata related?

dostalradim commented 1 year ago

We use the app very little and only from Monday to Friday, from 07:00 to 15:30. In the time window of the sent graph we certainly did not do anything with the metadata, at most we used few queries, mutations and triggers. No one uses it at night and outside these times but the memory is still growing. The only thing that talks to the hasura constantly is the health check probe on the /healthz link.

cheets commented 1 year ago

FWIW We also have k8s probes configured towards the /healthz endpoint. We are using the CE docker image. Configuration:

livenessProbe:
  httpGet:
    path: '/healthz'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 3
  periodSeconds: 60
  successThreshold: 1
  failureThreshold: 5
readinessProbe:
  httpGet:
    path: '/healthz'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 3
  periodSeconds: 30
  successThreshold: 1
  failureThreshold: 5
dostalradim commented 1 year ago

Three days ago, I deployed hasura against empty postgres, no application, no any access from outside and the used memory is increasing. Only thing which is using the hasura is liveness probe, could it be the problem? I hope that this can help you. Hasura version is 2.23.0.

image
tirumaraiselvan commented 1 year ago

@dostalradim Thank you...this really helps. We are working on this.

dostalradim commented 1 year ago

And the last graph, empty database, empty hasura, no probe. Used memory is still increasing.

image
yoeven commented 1 year ago

I have the same issue at v2.17.1

image

Now downgraded to v2.15.2

NikPaushkin commented 1 year ago

Do you see some kind of leak with no traffic as well?

@tirumaraiselvan It's leaking constantly no matter what. That's strange that you can't reproduce it still.

davidpanic commented 1 year ago

We are also seeing this with some mixed subscription and query traffic.

Along with the high memory usage there are also abnormally high CPU usage spikes every 10 seconds. The sudden drop at the end corresponds to when we restarted the hasura container.

image

This might be related to #2565 where they mention they also had memory leaks.

That was supposed to be "fixed" with #2581 but what my wisdom tells me is that would just postpone when the leak starts to be a problem instead of fixing it.

cheets commented 1 year ago

Did another test, this time with fresh cluster with no traffic. Used version 2.27.0-beta.1 with EE trial activated. Memory reaches 512mb limit in 4,5 days and restarts pod. I used GHCRTS = -I0 env variable for the pod to disable idle GC on Haskell. It didn't help, memory was leaking from the start and cpu usage rose with similar pattern. Only traffic towards Hasura was kubernetes probing the healthz endpoint.

hasura_2 27 0-beta 1-ee-trial

jberryman commented 1 year ago

@davidpanic If the server is using a lot of memory, and is relatively idle, then those CPU spikes are expected as garbage collections that we force ( we need to do this for the side effects; we don't expect there to be much new garbage). See: https://github.com/search?q=repo%3Ahasura%2Fgraphql-engine%20ourIdleGC&type=code

Right now it's set to 10 seconds which is fairly arbitrary. it might be better to make it every 30 seconds. but another thing you might consider trying is running with the non-moving GC ( starting the server with graphql-engine +RTS --nonmoving-gc -RTS …other_opts…). This is experimental for now (You might experience crashes or bad behavior, depending on the version you're using), but is likely to become the default in future versions of hasura

Note: this is not to say that the memory usage that you're seeing is normal, just that the CPU spikes are expected

cheets commented 1 year ago

Just to update. I tested to dowgrade to v2.18.0 with our metadata. Basically removed all the hasura tables from db and started "fresh" with an older version. Memory leak is still present. hasura-custom-image_v2 18 0_with-metadata

Also tested v2.26.0 with plain Hasura image, no metadata, just an empty server. It looks like small memory leak is happening here as well. Memory usage has grown 40Mb: hasura-image_v2 26 0_without-metadata

Normally memory usage stays pretty much stable so I would say that small growth is still leaking.

jberryman commented 1 year ago

@cheets is this how you're measuring "WSS"?: https://www.brendangregg.com/blog/2018-01-17/measure-working-set-size.html

cheets commented 1 year ago

@cheets is this how you're measuring "WSS"?: https://www.brendangregg.com/blog/2018-01-17/measure-working-set-size.html

We are using kubernetes and for Grafana dashboards the query is kube_pod_container_resource_requests which is using kube-state-metrics. So basically we fetch information from the pod of how much memory it requests. The limit is currently 512Mb and once the pod reaches this kubernetes spins up another pod, moves traffic to the new one and gracefully shuts down the old one that is using too much memory.

So far on Azure AKS we have stable memory usage, "straight line" to say. Our newer clusters in CentOS (+K3S) and AWS EKS (which is CentOS/RHEL based) have this memory leak issue. Difficult to say if base OS has any fault in this or is it just a coincidence. As we use Terraform to build our clusters all the code / stuff / components that we work on are the same between different clusters.

its-a-feature commented 1 year ago

I'm also experiencing this issue in an open source project mythic. This is a project I sometimes use to help teach security concepts, but recently I noticed that the Hasura container was running out of memory and getting killed by the host.

Everything is running in docker and docker compose. My Hasura configuration for Docker is extremely simple (https://github.com/its-a-feature/Mythic/blob/master/hasura-docker/Dockerfile) and all of my metadata (https://github.com/its-a-feature/Mythic/tree/master/hasura-docker/metadata) is freely available in case that makes things a little easier.

I have a single page open that's doing some streaming subscriptions. When I restart the Hasura container the memory is at about 420MB. After 12 hours or so of not doing anything other than streaming subscription for static data, the memory usage is already up to 1GB.

I'm happy to offer more insight and answer any questions you might have about this too.

jberryman commented 1 year ago

@its-a-feature thanks, this sounds really promising. When you have a moment can you try to provide as exact as possible steps to reproduce? For example a short shell script that produces the kind of streaming subscription workload you mentioned. I want to make sure that I'm reproducing what you're seeing as closely as possible.

meantime a couple things you can check for me, when OS-reported memory usage gets whatever you consider high:

its-a-feature commented 1 year ago

Yup! To reproduce do the following:

  1. git clone https://github.com/its-a-feature/Mythic
  2. make (this generates a mythic-cli binary for manipulating docker-compose easier for people)
  3. ./mythic-cli start (this generates a docker-compose file, populates a .env file, builds all of the docker images, and brings mythic online)
  4. ./mythic-cli config get mythic_admin_password (this will be your password to log in)
  5. open up a browser and go to https://ip of host where you installed mythic:7443
  6. log in with username mythic_admin and password from step 4
  7. click the phone icon at the top (tooltip will say Active Callbacks) - this will start the subscriptions
  8. wait and monitor docker stats with docker stats (you're looking for the mythic_graphql container)

With some extra containers installed, this is the memory usage reported by docker when we saw things climbing up more than they should:

CONTAINER ID   NAME                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O         PIDS
ad0f2c1e5c1d   mythic_server          1.57%     86.18MiB / 3.82GiB   2.20%     244MB / 228MB    23.9MB / 76.8MB   8
b780557de320   mythic_graphql         3.88%     1.66GiB / 3.82GiB    43.46%    9GB / 593MB      76.8MB / 815kB    18
7b57fcb627d1   mythic_react           0.00%     5.422MiB / 3.82GiB   0.14%     292kB / 464kB    3.91MB / 8.19kB   3
2930934f9379   mythic_postgres        14.15%    329.1MiB / 3.82GiB   8.41%     468MB / 9.06GB   70.2MB / 1.03GB   24

Right now (having it be up for only a few hours) i'm seeing:

CONTAINER ID   NAME                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O         PIDS
6ed08ab957cc   mythic_graphql         4.96%     569.7MiB / 2GiB       27.82%    8.24GB / 83.5MB   20.5MB / 4.78MB   46

(it started out at mem usage of 340)

root@6ed08ab957cc:/# curl http://127.0.0.1:8080/dev/rts_stats | jq '.gc.gcdetails_live_bytes / 1000000, .gc.gcdetails_mem_in_use_bytes / 1000000'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1331    0  1331    0     0  38394      0 --:--:-- --:--:-- --:--:-- 39147
90.803952
217.055232

I'll try to re-run it with that third bullet about the LD_PRELOAD now.

its-a-feature commented 1 year ago

@jberryman there's a class going right now that's using this open source project where the mem usage is higher, so maybe those stats will be easier for you to use:

CONTAINER ID   NAME                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O         PIDS
fa47795cf41e   mythic_graphql         2.59%     1.689GiB / 2GiB      84.46%    5.41GB / 2.36GB   20.8MB / 4.78MB   17

and

root@fa47795cf41e:/# curl http://127.0.0.1:8080/dev/rts_stats | jq '.gc.gcdetails_live_bytes / 1000000, .gc.gcdetails_mem_in
_use_bytes / 1000000'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1372    0  1372    0     0   6263      0 --:--:-- --:--:-- --:--:--  6264
1304.790648
1631.584256

same Hasura image, metadata, etc. This is just one that's been up for over a day (massive difference)

chrismaddalena commented 1 year ago

I've recently upgraded to Hasura v2.28.0 on a development branch for a project I maintain called Ghostwriter. I've had the most recent development build running for 14 days, and I checked the stats:

$ docker ps

CONTAINER ID   IMAGE                             COMMAND                  CREATED       STATUS                 PORTS                                                                                  NAMES
d9a95e0fdb10   ghostwriter_production_graphql    "docker-entrypoint.s…"   14 days ago   Up 14 days (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 0.0.0.0:9691->9691/tcp, :::9691->9691/tcp   ghostwriter_graphql_engine_1
$ docker stats

CONTAINER ID   NAME                           CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O         PIDS
d9a95e0fdb10   ghostwriter_graphql_engine_1   0.33%     263.2MiB / 1.93GiB   13.32%    3.73GB / 4.57GB   40.6MB / 8.19kB   10
$ curl http://172.19.0.6:8080/dev/rts_stats | jq '.gc.gcdetails_live_bytes / 1000000, .gc.gcdetails_mem_in_use_bytes / 1000000'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1315    0  1315    0     0   8594      0 --:--:-- --:--:-- --:--:--  8766
65.990904
137.363456

My Hasura container appears to be stable. The development server has not been heavily used, but the other reports suggest I should see memory usage climbing over 14 days. I'll keep an eye on it and report back if I do something that starts to cause memory usage to increase.

jberryman commented 1 year ago

@its-a-feature thanks! In https://github.com/hasura/graphql-engine/issues/9592#issuecomment-1644228995 we can see the memory usage is due to the haskell heap and associated with live data, which is a helpful place to start. We'll update here when we can try out your repro, thanks again.

jberryman commented 1 year ago

@its-a-feature Your instructions worked well, thanks. Following your instructions I'm not seeing .gc.gcdetails_live_bytes increasing over time, after letting it run for a few hours and checking in on it periodically. The OS reported memory does grow somewhat, and mimalloc might help you here.

Is there an easy way for me to simulate more accurately what your users are doing? let me know if you're able to try out mimalloc as well

This was on v2.30.1 which was latest when the scripts pulled the container

KevinColemanInc commented 1 year ago
image

Maybe less noticeable, but when I upgraded to Hasura 2.30.1, my service starting to have "OOM" crashes. Not sure if feature bloat is causing more expensive machines or if its a bug.

jberryman commented 1 year ago

FYI: v2.34 has a memory leak fix which some folks here might have been hitting

LasaleFamine commented 9 months ago

We are running on v2.37 and this is still the result: Screenshot 2024-02-15 at 10 30 25

We just had a crash yesterday and few weeks ago. This is critical, have someone found a solution?

triracle97 commented 4 months ago

I'm still having this on v2.40.1 image

its-a-feature commented 3 months ago

Any traction on this? I'm pretty sure this just crashed two more times in my system causing out of memory issues. Using v2.38.1. I'm not using enterprise, but is there some sort of flag I can flip to help provide more information to you about the error when I encounter it?

KevinColemanInc commented 3 months ago

I bumped my memory up and the issues went away.

image