Open sergeimonakhov opened 1 year ago
HI @sergeimonakhov
What kind of workloads are you running in Hasura in the timeframe provided in your screenshot?
Hi @tirumaraiselvan. Mutations, queries and subscriptions to PostgreSQL are the only workloads which we use in the background.
@tirumaraiselvan The only thing we added compared to the previous version is remote schema permissions. I have just noticed, that Hasura stores them all as a huge string in a database and Hasura UI is freezing more and more with each new permission rule added. Given that architecture, there can be related memory issues on a back-end, definitely.
Something is happening for sure. This is a Hasura instance deployed at Railway.app
After restarting the memory usage dropped under 300 MB.
@tirumaraiselvan The only thing we added compared to the previous version is remote schema permissions. I have just noticed, that Hasura stores them all as a huge string in a database and Hasura UI is freezing more and more with each new permission rule added. Given that architecture, there can be related memory issues on a back-end, definitely.
@NikPaushkin Hi, thanks for additional info. Can you confirm if without remote schema permissions, you do not see any constant increase in memory (in v2.22)? The UI issue might be unrelated and solely be a console issue.
@tirumaraiselvan No, I can't confirm it now. We have 2.24.1 now without those remote schema permissions changes and it's still leaking by 500MB every day.
Hi, we have this same issue. It's affecting multiple clusters where we have Hasura deployed. It's been happening from 2.2x.x versions. Our latest memory leak test with a new cluster on v2.24.1. The cluster was created from scratch (using terraform scripts) and there was no traffic whatsoever and it's leaking memory all the time. We have a 512Mb limit on the pod and it restarts when the limit is reached. You can see that from the attached behaviour. We don't have any remote schemas.
We're on 2.25.0 and also seeing a memory leak and then OOM causing restarts. We have some remote schemas.
@cheets @tjenkinson Hey folks, just to confirm once again. You had the exact same metadata in previous versions and it didn't cause a memory growth like in newer versions?
Hey @tirumaraiselvan the metadata may have changed slightly. We were on 2.16.0 before and that appeared to have the same issue
We are actively developing our API so some changes occur every week to metadata. I don't recall anything major thought. We have some actions and couple event triggers but these have been in the metadata for a long time.
What is weird is that we can see this behavior on our K3S on-premise clusters. However we are also running exact same Hasura in Azure AKS and we haven't observed this memory leak there.
@cheets We are unable to reproduce this on idle schema like you mentioned here: https://github.com/hasura/graphql-engine/issues/9592#issuecomment-1551596064
You are saying you can't reproduce this on AKS. Do you want to test with different versions of Hasura on k3s to see if this might be a k3s issue (seems like this is being reported on newer versions so trying something like 2.11 might be a good start)? Also, maybe you can share your scripts so we can reproduce this on our end?
@NikPaushkin @tjenkinson Do you see some kind of leak with no traffic as well?
Our instances are always getting some traffic so not sure on that sorry
I’ve managed to reliably trigger the leak by repeatedly reloading all remote schemas from the console. Every time I do it memory goes up slightly
@tjenkinson Do you also have remote schema permissions?
@tirumaraiselvan this is after logging in with the admin secret
@tjenkinson Is it possible for you to send us your metadata? You can email me at tiru@hasura.io
.
hey @tirumaraiselvan unfortunately we are not able to do that due to the nature of what it contains. Noticed we also have some webhooks set up on event triggers. Not sure if that could also be a factor 🤷
@tjenkinson You can send us a smaller version by redacting or removing any sensitive info (it need not even work)? This is just to short circuit lots of metadata related questions we get when trying to reproduce/triage such issues.
Just FYI, we are not able to reproduce this by constantly reloading remote schemas. That's why I wanted to know if you have Remote Schema permissions configured in the metadata as well. Is that the case?
We have several instances of hasura and we do not use remote schema anywhere and we observe memory leaks as well. Even on an instance that is minimally used, once or twice a day. We use triggers on almost every one of them, but for example on the one mentioned above they are called about twice a day and memory leaks occur as well. The oldest instance we observe this on is version 2.20.1.
@dostalradim Are you able to share your metadata with us? You can email it to me at tiru@hasura.io
(feel free to remove any sensitive info)
I sent you the metadata of an application that is very rarely visited and its memory usage graph looks like this. Hasura version is 2.20.1
Thank you for investigation.
@dostalradim This is very helpful. Thank you. Could you also help us with the kind of activity that you have on this deployment? Is it queries, mutations, subscriptions, metadata related?
We use the app very little and only from Monday to Friday, from 07:00 to 15:30. In the time window of the sent graph we certainly did not do anything with the metadata, at most we used few queries, mutations and triggers. No one uses it at night and outside these times but the memory is still growing. The only thing that talks to the hasura constantly is the health check probe on the /healthz link.
FWIW We also have k8s probes configured towards the /healthz endpoint. We are using the CE docker image. Configuration:
livenessProbe:
httpGet:
path: '/healthz'
port: http
initialDelaySeconds: 30
timeoutSeconds: 3
periodSeconds: 60
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: '/healthz'
port: http
initialDelaySeconds: 30
timeoutSeconds: 3
periodSeconds: 30
successThreshold: 1
failureThreshold: 5
Three days ago, I deployed hasura against empty postgres, no application, no any access from outside and the used memory is increasing. Only thing which is using the hasura is liveness probe, could it be the problem? I hope that this can help you. Hasura version is 2.23.0.
@dostalradim Thank you...this really helps. We are working on this.
And the last graph, empty database, empty hasura, no probe. Used memory is still increasing.
I have the same issue at v2.17.1
Now downgraded to v2.15.2
Do you see some kind of leak with no traffic as well?
@tirumaraiselvan It's leaking constantly no matter what. That's strange that you can't reproduce it still.
We are also seeing this with some mixed subscription and query traffic.
Along with the high memory usage there are also abnormally high CPU usage spikes every 10 seconds. The sudden drop at the end corresponds to when we restarted the hasura container.
This might be related to #2565 where they mention they also had memory leaks.
That was supposed to be "fixed" with #2581 but what my wisdom tells me is that would just postpone when the leak starts to be a problem instead of fixing it.
Did another test, this time with fresh cluster with no traffic. Used version 2.27.0-beta.1
with EE trial activated. Memory reaches 512mb limit in 4,5 days and restarts pod. I used GHCRTS = -I0
env variable for the pod to disable idle GC on Haskell. It didn't help, memory was leaking from the start and cpu usage rose with similar pattern. Only traffic towards Hasura was kubernetes probing the healthz
endpoint.
@davidpanic If the server is using a lot of memory, and is relatively idle, then those CPU spikes are expected as garbage collections that we force ( we need to do this for the side effects; we don't expect there to be much new garbage). See: https://github.com/search?q=repo%3Ahasura%2Fgraphql-engine%20ourIdleGC&type=code
Right now it's set to 10 seconds which is fairly arbitrary. it might be better to make it every 30 seconds. but another thing you might consider trying is running with the non-moving GC ( starting the server with graphql-engine +RTS --nonmoving-gc -RTS …other_opts…
). This is experimental for now (You might experience crashes or bad behavior, depending on the version you're using), but is likely to become the default in future versions of hasura
Note: this is not to say that the memory usage that you're seeing is normal, just that the CPU spikes are expected
Just to update. I tested to dowgrade to v2.18.0 with our metadata. Basically removed all the hasura tables from db and started "fresh" with an older version. Memory leak is still present.
Also tested v2.26.0 with plain Hasura image, no metadata, just an empty server. It looks like small memory leak is happening here as well. Memory usage has grown 40Mb:
Normally memory usage stays pretty much stable so I would say that small growth is still leaking.
@cheets is this how you're measuring "WSS"?: https://www.brendangregg.com/blog/2018-01-17/measure-working-set-size.html
@cheets is this how you're measuring "WSS"?: https://www.brendangregg.com/blog/2018-01-17/measure-working-set-size.html
We are using kubernetes and for Grafana dashboards the query is kube_pod_container_resource_requests which is using kube-state-metrics. So basically we fetch information from the pod of how much memory it requests. The limit is currently 512Mb and once the pod reaches this kubernetes spins up another pod, moves traffic to the new one and gracefully shuts down the old one that is using too much memory.
So far on Azure AKS we have stable memory usage, "straight line" to say. Our newer clusters in CentOS (+K3S) and AWS EKS (which is CentOS/RHEL based) have this memory leak issue. Difficult to say if base OS has any fault in this or is it just a coincidence. As we use Terraform to build our clusters all the code / stuff / components that we work on are the same between different clusters.
I'm also experiencing this issue in an open source project mythic. This is a project I sometimes use to help teach security concepts, but recently I noticed that the Hasura container was running out of memory and getting killed by the host.
Everything is running in docker and docker compose. My Hasura configuration for Docker is extremely simple (https://github.com/its-a-feature/Mythic/blob/master/hasura-docker/Dockerfile) and all of my metadata (https://github.com/its-a-feature/Mythic/tree/master/hasura-docker/metadata) is freely available in case that makes things a little easier.
I have a single page open that's doing some streaming subscriptions. When I restart the Hasura container the memory is at about 420MB. After 12 hours or so of not doing anything other than streaming subscription for static data, the memory usage is already up to 1GB.
I'm happy to offer more insight and answer any questions you might have about this too.
@its-a-feature thanks, this sounds really promising. When you have a moment can you try to provide as exact as possible steps to reproduce? For example a short shell script that produces the kind of streaming subscription workload you mentioned. I want to make sure that I'm reproducing what you're seeing as closely as possible.
meantime a couple things you can check for me, when OS-reported memory usage gets whatever you consider high:
curl http://<your_hasura_instance>/dev/rts_stats | jq '.gc.gcdetails_live_bytes / 1000000, .gc.gcdetails_mem_in_use_bytes / 1000000
LD_PRELOAD=mimalloc.so
(some v2 but less than v2.1.2); The idea here is to see whether memory is ballooning within the malloc implementation itself (which might also be using MADV_FREE; as far as I can tell there is no way to actually monitor memory released to the OS in this way). mimalloc happens to be more keen to free memory to the OS and does not use MADV_FREE until 2.1.2Yup! To reproduce do the following:
git clone https://github.com/its-a-feature/Mythic
make
(this generates a mythic-cli binary for manipulating docker-compose easier for people)./mythic-cli start
(this generates a docker-compose file, populates a .env file, builds all of the docker images, and brings mythic online)./mythic-cli config get mythic_admin_password
(this will be your password to log in)https://ip of host where you installed mythic:7443
mythic_admin
and password from step 4Active Callbacks
) - this will start the subscriptions docker stats
(you're looking for the mythic_graphql
container)With some extra containers installed, this is the memory usage reported by docker when we saw things climbing up more than they should:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
ad0f2c1e5c1d mythic_server 1.57% 86.18MiB / 3.82GiB 2.20% 244MB / 228MB 23.9MB / 76.8MB 8
b780557de320 mythic_graphql 3.88% 1.66GiB / 3.82GiB 43.46% 9GB / 593MB 76.8MB / 815kB 18
7b57fcb627d1 mythic_react 0.00% 5.422MiB / 3.82GiB 0.14% 292kB / 464kB 3.91MB / 8.19kB 3
2930934f9379 mythic_postgres 14.15% 329.1MiB / 3.82GiB 8.41% 468MB / 9.06GB 70.2MB / 1.03GB 24
Right now (having it be up for only a few hours) i'm seeing:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
6ed08ab957cc mythic_graphql 4.96% 569.7MiB / 2GiB 27.82% 8.24GB / 83.5MB 20.5MB / 4.78MB 46
(it started out at mem usage of 340)
root@6ed08ab957cc:/# curl http://127.0.0.1:8080/dev/rts_stats | jq '.gc.gcdetails_live_bytes / 1000000, .gc.gcdetails_mem_in_use_bytes / 1000000'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1331 0 1331 0 0 38394 0 --:--:-- --:--:-- --:--:-- 39147
90.803952
217.055232
I'll try to re-run it with that third bullet about the LD_PRELOAD now.
@jberryman there's a class going right now that's using this open source project where the mem usage is higher, so maybe those stats will be easier for you to use:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
fa47795cf41e mythic_graphql 2.59% 1.689GiB / 2GiB 84.46% 5.41GB / 2.36GB 20.8MB / 4.78MB 17
and
root@fa47795cf41e:/# curl http://127.0.0.1:8080/dev/rts_stats | jq '.gc.gcdetails_live_bytes / 1000000, .gc.gcdetails_mem_in
_use_bytes / 1000000'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1372 0 1372 0 0 6263 0 --:--:-- --:--:-- --:--:-- 6264
1304.790648
1631.584256
same Hasura image, metadata, etc. This is just one that's been up for over a day (massive difference)
I've recently upgraded to Hasura v2.28.0 on a development branch for a project I maintain called Ghostwriter. I've had the most recent development build running for 14 days, and I checked the stats:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d9a95e0fdb10 ghostwriter_production_graphql "docker-entrypoint.s…" 14 days ago Up 14 days (healthy) 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 0.0.0.0:9691->9691/tcp, :::9691->9691/tcp ghostwriter_graphql_engine_1
$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d9a95e0fdb10 ghostwriter_graphql_engine_1 0.33% 263.2MiB / 1.93GiB 13.32% 3.73GB / 4.57GB 40.6MB / 8.19kB 10
$ curl http://172.19.0.6:8080/dev/rts_stats | jq '.gc.gcdetails_live_bytes / 1000000, .gc.gcdetails_mem_in_use_bytes / 1000000'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1315 0 1315 0 0 8594 0 --:--:-- --:--:-- --:--:-- 8766
65.990904
137.363456
My Hasura container appears to be stable. The development server has not been heavily used, but the other reports suggest I should see memory usage climbing over 14 days. I'll keep an eye on it and report back if I do something that starts to cause memory usage to increase.
@its-a-feature thanks! In https://github.com/hasura/graphql-engine/issues/9592#issuecomment-1644228995 we can see the memory usage is due to the haskell heap and associated with live data, which is a helpful place to start. We'll update here when we can try out your repro, thanks again.
@its-a-feature Your instructions worked well, thanks. Following your instructions I'm not seeing .gc.gcdetails_live_bytes
increasing over time, after letting it run for a few hours and checking in on it periodically. The OS reported memory does grow somewhat, and mimalloc might help you here.
Is there an easy way for me to simulate more accurately what your users are doing? let me know if you're able to try out mimalloc as well
This was on v2.30.1 which was latest
when the scripts pulled the container
Maybe less noticeable, but when I upgraded to Hasura 2.30.1, my service starting to have "OOM" crashes. Not sure if feature bloat is causing more expensive machines or if its a bug.
FYI: v2.34 has a memory leak fix which some folks here might have been hitting
We are running on v2.37 and this is still the result:
We just had a crash yesterday and few weeks ago. This is critical, have someone found a solution?
I'm still having this on v2.40.1
Any traction on this? I'm pretty sure this just crashed two more times in my system causing out of memory issues. Using v2.38.1. I'm not using enterprise, but is there some sort of flag I can flip to help provide more information to you about the error when I encounter it?
I bumped my memory up and the issues went away.
Version Information
Server Version: v2.22.1 CLI Version (for CLI related issue):
Environment
On-premises
What is the current behaviour?
We've been seeing memory leaks since version v2.19.x and higher
Upgrading to the latest version did not solve the problem, version v2.8.4 works correctly.
What is the expected behaviour?
To work without memory leaks
How to reproduce the issue?
Screenshots or Screencast
OOM:
Please provide any traces or logs that could help here.
Any possible solutions/workarounds you're aware of?
Keywords
memory leak