pganalyze / collector

pganalyze statistics collector for gathering PostgreSQL metrics and log data
https://pganalyze.com
Other
331 stars 57 forks source link

Memory leak on v0.22.0 #37

Closed beyondbill closed 1 year ago

beyondbill commented 4 years ago

We are running quay.io/pganalyze/collector:v0.22.0 in Kubernetes and experiencing memory leak. Haven't tried whether an upgrade would fix it but I did not see any mention of memory leak in existing issues on this repo.

image

lfittl commented 4 years ago

@beyondbill Thanks for the report - that indeed looks like a memory leak.

Upgrading may fix it if the issue is related to Go garbage collector problems - newer collector versions have a newer Go version used for builds.

If that doesn't fix it, it would be helpful to narrow the problem down, if you could check whether turning off log collection (PGA_DISABLE_LOGS=1) avoids the memory usage behaving like this.

Thanks, Lukas

beyondbill commented 4 years ago

@lfittl We plan to upgrade to v0.25.1. When will the version be published to Docker registry?

➜  ~ docker pull quay.io/pganalyze/collector:v0.25.1
Error response from daemon: manifest for quay.io/pganalyze/collector:v0.25.1 not found: manifest unknown: manifest unknown
lfittl commented 4 years ago

@beyondbill Published this morning - you should be good to run docker pull now.

beyondbill commented 4 years ago

@lfittl Perfect! Thanks! I'll let you know whether it fixes the memory leak.

beyondbill commented 4 years ago

image I upgraded the collector to v0.25.1 with PGA_DISABLE_LOGS=1. However there's still a trend of memory leak. I'll leave it there for a longer period to conclude it.

beyondbill commented 4 years ago

image @lfittl I confirmed that the memory leak still exists... We currently put a memory limit on the collector pod in k8s so that it gets killed once memory usage reaches the limit. But we hope to see this issue gets fixed on the collector.

JordanP commented 4 years ago

Same. It's leaking memory for sure, we (Gorgias, Scale plan) also deal with that leak with a memory limit (Kubernetes).

image

lfittl commented 4 years ago

@JordanP Thanks for the report - we are still struggling to find the root cause here, but are certainly tracking this as an active issue that we keep investigating.

One thing that would be helpful for us to validate: When you set PGA_DISABLE_LOGS=1 on the container (which turns off log parsing), does the memory usage keep increasing the same way?

JordanP commented 4 years ago

@lfittl log parsing is not working anyway because we are running PG and PGAnalyze in Kubernetes. Should I still try to set PGA_DISABLE_LOGS=1 ?

lfittl commented 4 years ago

@beyondbill @JordanP We've just released collector version 0.30.0, which fixes a memory leak we found that could occur in certain situations. Could you upgrade and check whether that improves the situation for you?

beyondbill commented 4 years ago

@lfittl Awesome! I deployed v0.3.0 and will let you know after a few days!

beyondbill commented 4 years ago

@lfittl I believe the memory is still leaking on v0.3.0 image

JordanP commented 4 years ago

Same here, upgraded to 0.30 and memory usage is still growing: image Note that we have 2 PG analyze collectors (we have 2 PG databases) and one is not suffering from that leak. PGA is deployed configured the same but the PG instances are used for very different workloads.

msakrejda commented 4 years ago

@JordanP can you tell us more about the nature of the different workloads (and log output)? Are the two lines in the graph you pasted above for the two different databases?

JordanP commented 4 years ago

@uhoh-itsmaciek I don't know exactly what to say. The one for which PGA is not leaking is used to mostly "log" some sort, so it's mostly a append only workload. The other PG instance is our main one, that serves our application/website.

Yes the 2 lines in the graph are the memory consumption for the 2 PGA containers, each is monitoring a different PG database.

lfittl commented 4 years ago

@JordanP Thanks for the update - I have a suspicion that this might be related to the amount of query text data the collector is handling.

Could you try enabling the reset setting in the Server Settings in the pganalyze console, and set it to every 6 hours:

Screen Shot 2020-06-23 at 12 55 36 AM

Additionally, you'll need to create the following helper method, when using the separate monitoring user (create this method as a superuser):

CREATE OR REPLACE FUNCTION pganalyze.reset_stat_statements() RETURNS SETOF void AS
$$
  /* pganalyze-collector */ SELECT * FROM public.pg_stat_statements_reset();
$$ LANGUAGE sql VOLATILE SECURITY DEFINER;
JordanP commented 4 years ago

@lfittl so I tried this but no luck. I am not that sure it's a memory leak anymore.

This is the memory used by the pganalyze collector (v0.30): image

We have 2 Kubernetes clusters. each exhibit the same behavior. PGAnalyze restarts every ~day. The memory limit is to 250 Mi which is never actually reached. And Kubernetes just reports the container to have exited: image

Exit code 139 means, but I am not sure, seg fault.

So it might not be just a memory leak. What weird is that there nothing in my container logs beside a lot of Submitted compact activity snapshot successfully and a couple of Could not collect activity for server: failed to upload/send activity snapshot: Post "https://api.pganalyze.com/v2/snapshots/compact": EOF

Thoughts ? Thanks !

lfittl commented 4 years ago

@JordanP Thanks for the update!

I'm wondering if you're hitting the bug we fixed in https://github.com/pganalyze/collector/commit/c3d2c3cb49c833d34f9482a9e7ea91071d2eb7ed (released with 0.32.0)

That bug would surface itself as the collector crashing when it encountered a network error. The backtrace for the error unfortunately doesn't show reliably, which might explain why you are just seeing the container exiting without any explicit message.

Could you attempt updating to the v0.32.0 tag and see whether that fixes it?

JordanP commented 4 years ago

@lfittl I upgraded to 0.32.0, I'll keep you posted.

JordanP commented 4 years ago

It did not help, same crashes, same frequencies. I understand it's a hard to debug problem. It trips my monitoring system because processes are not supposed to crash. I am gonna silent this particular alert on this particular process, which means that issue will move out from my radar.

Ping me if I can be of any help here in the future.

dbamohsin commented 3 years ago

Hi @lfittl,

I was going to email you directly but this issue seems to fit!

image

We are also seeing pods restart as they keep hitting a limit we have set - based on the trajectory of how memory seems to be eaten up by pganalyze, the limit is arbitrary and i think it would eventually hit any limit.

Like @JordanP above, it trips our monitoring too and ideally we dont want to create an indefinite silence on it so if you have any further details on it that would be great.

collector version 0.35

Thanks

Mohsin

msakrejda commented 3 years ago

Hi @dbamohsin -- could you try with the 0.36 we just released when you have a chance? I don't think any of the fixes there would have impacted memory usage much, but it'd be nice to verify behavior of the latest version.

dbamohsin commented 3 years ago

@uhoh-itsmaciek - I pre-empted this and upgraded this morning. Currently monitoring so will feed back in the next couple of days. Thanks

dbamohsin commented 3 years ago

Hi @uhoh-itsmaciek verified behaviour on 0.36

image

lfittl commented 3 years ago

I wanted to give everyone an update on this issue, and where we currently stand. There are three theories as to the root cause of this memory leak, by order in which we are tackling them:

1) Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at https://github.com/pganalyze/collector/pull/151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

2) A problem with log processing in pganalyze retaining too many old loglines that couldn't be correctly associated (we have some logic to ensure that a LOG line followed by e.g. DETAIL line gets linked together). Whilst previous audits of this code have not shown an issue, we'll take a closer look if (1) doesn't resolve things.

3) A yet undiscovered leak in the collector codebase that we haven't reproduced yet.

Separately from this, it's also likely that the newer Go release (1.16) improves memory usage on Linux specifically (where the collector is deployed most of the time), and we plan to utilize this Go version in the next or next next release. See details in the Go 1.16 release notes here: https://golang.org/doc/go1.16#runtime

JordanP commented 3 years ago

Cool ! Thanks for the update !

I don't 2. is the root cause. I saw a growing memory usage even though I did not enable log collection/processing at all.

dbamohsin commented 3 years ago
  1. Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

The memory leak fix does not seem to make a difference to the leak unfortunately.

We applied on Friday 17th around 4pm:

image

lfittl commented 3 years ago
  1. Memory leak in the Postgres parser library that the pganalyze collector utilizes - this has been confirmed in a particular edge case, and a fix is available for testing at #151 (the parser library also now has a confirmed clean Valgrind leakcheck run)

The memory leak fix does not seem to make a difference to the leak unfortunately.

We applied on Friday 17th around 4pm:

image

Thanks for the update - appreciate the quick verification!

Two follow-on things to test:

1) We've just released collector version 0.37.0, which is built using Go 1.16 - can you update to the quay.io/pganalyze/collector:v0.37.0 tag and see if that makes any difference?

2) Per @JordanP's report, at least in their situation, it's unlikely that logs are the cause - which means (unless we see two different problems here), we should look for other leaks not yet known.

One way to get some more data on the root cause, would if you check the size of the collector internal state structure, which gets written to disk on every full snapshot (once every 10 minutes). Could you run this inside the container, a few hours apart, and share the size of the state file?

du -sm /state/pganalyze-collector.state

(since this is stored in memory as well, it gives us a hint whether the problem is the collector state itself, vs something else, like e.g. a runaway goroutine)

dbamohsin commented 3 years ago

hi @lfittl - we have been running 0.37.0 in prod since Monday and unfortuantely the result is similar, although the growth is a lot choppier now (as can be seen in the second graph below

2 day window

image

last 6 hours

image

i will check the size of the collector internals today

lfittl commented 3 years ago

@dbamohsin @JordanP Whilst not directly an update on the memory leak, it might be worth deploying the new collector release 0.38.1, which has upgrade to the new query parser version.

The query parser now uses a different mechanism to access the parse trees (essentially it avoids a whole bunch of JSON parsing), which would have shown up as an increased memory usage if you are using log-based EXPLAIN (e.g. such as on GCP).

There may also be other memory-related improvements associated to a few changes in the query parser, so would be great to see if this impacts your environment in any way.

dbamohsin commented 3 years ago

HI @lfittl - we have been runnig 0.38.1 for a few days now and it does seem a lot smoother for memory:

image

I'll keep an eye on it for another week and let you know for sure.

dbamohsin commented 3 years ago

Hi @lfittl - although the memory situation seems a little better - it still gradually gets to the point of over 100% utilization eventually, so there still looks to be an underlying memory leak. we are now starting to get containers restarting after 10 days. Here is a 7 day view:

image

all collectors show the same behaviour regardless of instance - albeit some leak faster than others.

JordanP commented 1 year ago

FWIW, I can't observe the memory leak I saw in since 2021 now that I am running v0.46. It seems fixed to me.

lfittl commented 1 year ago

@JordanP Thank you for the update!

I believe we have made sufficient improvements here that I feel comfortable closing this issue. If anyone runs into this again, please feel free to report back / open a new issue.