invalid memory allocation, out of shared memory

edib commented 6 years ago

More than 8 months we have been using powa for monitoring for our prod clusters, but after some points our # of transactions passed 3k and we see the invalid memory allocation and out of shared memory errors, many times and without any pattern. At finally stage, postmaster put in i a recovery mode every 4-5 minutes and all prod system technically failed. Fortunately, a friend guided us and we removed the powa extension from shared_preloaded_libraries then all errors was gone. What can be the powa's problem with memory and postmaster crushes?

rjuju commented 6 years ago

Argh, that's bad.

Can you show us the versions of installed extensions? (\dx on the powa database should be enough), any important details about the server (32/64 bits, RAM...).

Also, can you share some relevant logs for these errors, including the executed query that triggered the error?

edib commented 6 years ago

Powa version is 3.1.0, the postgres versions are 9.6.x from 9.6.3 to 9.6.8. Because we moved the master many times from a server to server. OSes are Centos 7 (Centos 7.2s, 7.3s core). Some of our machines are vmware virtual machines (Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz) on Cisco UCS, and RAMs are changing. Lastly, we moved the master to "AMD Opteron(tm) Processor 6376" with 512GB RAM one with ssd disks, one with sas disks, faced same issue on both machines and master and read replicas. For example application runs a join with different parameters, that runs 5 times without error, then sixth one produces error. Queries generally come from hibernate ORM and as far as I can see all are select queries.

Clusters are relatively big. One is 4k transactions, the other is 1k-2k transactions per second. *For the privacy issues, I cannot share all the log lines. But related error lines are below.

// lines in the middle of transactions (I have seen many different numbers.) ERROR: invalid memory alloc request size 2251799813685248

// lines at the end of transactions LOG: duration: 0.011 ms parse S_1: COMMIT ERROR: invalid memory alloc request size 2251799813685248 STATEMENT: COMMIT

// lines at the end of transactions LOG: duration: 0.012 ms parse S_1: COMMIT ERROR: out of shared memory STATEMENT: COMMIT

// This messages for postmaster crash cases FATAL: the database system is in recovery mode

Thanks.

rjuju commented 6 years ago

You only have/had powa extension installed, not pg_qualstats or pg_stat_kcache? pg_stat_kcache had a bug leading to exhaustion of shared memory in the previous version.

Also, you missed powa 3.1.1 which fixed a few problematic bugs which could lead to this issue in case of high TPS rate and/or high number of object added/modified/dropped, see https://github.com/powa-team/powa-archivist/releases/tag/REL_3_1_1

Is the O/S 32 or 64 bits?

Do you have still have a coredump of one of the crash?

Can you reproduce the issue, if possible on a non-production server? If yes, it could be useful to try with powa 3.1.1.

edib commented 6 years ago

Yes, all were installed even hypopg with respect to the documentatiton. Yes, but we use centos packages and we have installed the server recently. Interestingly, the package version in the system is 3.1.1 but extension version in the database is 3.1.0. Why is that so? At the time period, we have restarted the system many times. Operating system definitely 64bit and memories are ranges 32GB to 512GB. We dont have coredump. We did not have time to take coredump. I will try but without production I cannot get high tps rates.

ng-pe commented 6 years ago

Hi @edib ! Can you confirm that the extension pg_qualstats is not installed?

If pg_qualstats v1.0.3 is installed... maybe it is this bug (https://github.com/powa-team/pg_qualstats/issues/20) freshly fixed by this commit (https://github.com/powa-team/pg_qualstats/commit/68c1dacbc57de664af4ad9b6e54a4b804f2d729e)

dtseiler commented 6 years ago

I'm worried that my pre-prod database was being hit by this as well. I added POWA over the weekend ahead of pre-prod stress testing that was going to be done today. Shortly after that kicked off we started seeing a flood of "ERROR: invalid memory alloc" errors. I couldn't even run a pg_dump without immediately hitting this error after less than 2 seconds.

Around 15:20 today I removed the POWA extension and the other related extensions. That's when I noticed that the error stopped around 15:06. Perhaps that's when the stress testing wound down, I'm waiting to confirm timelines with my developers.

I can also now run a pg_dump. It's still running after a few minutes but at least it didn't error out after 2 seconds.

dtseiler commented 6 years ago

FYI these are the versions that were installed for me:

                                     List of installed extensions
        Name        | Version |   Schema   |                        Description
--------------------+---------+------------+-----------------------------------------------------------
 btree_gist         | 1.2     | public     | support for indexing common datatypes in GiST
 hypopg             | 1.1.0   | public     | Hypothetical indexes for PostgreSQL
 pg_qualstats       | 1.0.3   | public     | An extension collecting statistics about quals
 pg_stat_kcache     | 2.0.3   | public     | Kernel statistics gathering
 pg_stat_statements | 1.4     | public     | track execution statistics of all SQL statements executed
 plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
 powa               | 3.1.1   | public     | PostgreSQL Workload Analyser-core
(7 rows)

I dropped powa, hypopg, pg_qualstats and pg_stat_kcache. I kept pg_stat_statements because that was there before as well. I removed the others form the shared_preloaded_libraries GUC and reloaded the conf but not sure if that takes effect without a DB bounce.

dtseiler commented 6 years ago

@ng-pe I looked at the pg_qualstats bug you referenced since my version was the same as the one you mentioned. In my logs I don't see an "out of shared memory" error, just the "invalid memory alloc" errors.

Also I didn't have any core dump or instance crash.

rjuju commented 6 years ago

@dtseiler Yes, this is definitely possible. This bug only existed on 9.6+ servers (I see on your report on pgsql-admin that you are using 9.6.6). Also, it only happen on a high TPS rate, I understand that it was the case for you?

Since it's not a production server, can you try with current HEAD to see if it fixes the problem? I can also make a proper release and wait for updated community packages if you prefer, but it'll probably take some time.

BTW, you can disable pg_qualstats without restarting postgres by setting pg_qualstats.enabled = false in your conf file and reload. It can avoid more pain if you hit any issue with pg_qualstats.

dtseiler commented 6 years ago

Re: high TPS rate, yeah it was during some HA stress-testing we were running. That was the reason I installed POWA in the first place was to monitor the system behavior during this testing.

While this isn't production, it's a pre-prod system that is used for client testing just before production and we try to avoid as much disruption as possible. So right now I can't really use it as a sandbox. I do have a separate sandbox if I can find a way to put high TPS load on there with pg_bench or something.

rjuju commented 6 years ago

Ah, yes of course.

I'll make a new release tomorrow in any case, this bug is critical enough I think.

rdunklau commented 6 years ago

@dtseiler the 1.0.4 version has just been released, thank you !

dtseiler commented 6 years ago

OK. What I might try to do is set up 1.0.3 on my sandbox and see if I can duplicate with pgbench. Then do the same with 1.0.4. Unless someone has already done that kind of verification. I don't think I can run the same HA stress test that we ran yesterday since all the pre-prod servers are in a different zone with no routing between them.

dtseiler commented 6 years ago

In the meantime, in my pre-prod database, every 10 seconds I see errors because the function powa_take_snapshot() doesn't exist (presumably after dropping the extension). Is there a way to disable that without having to restart the database for the shared_preload_libraries to take effect?

marco44 commented 6 years ago

Yes, put powa.frequency to -1 and reload

rafaelorafaelo commented 6 years ago

Hi, I can confirm this bug. We upgraded from PostgreSQL 9.5.12 to 9.6.9, using the same extension versions in both PG versions. After upgrading pg_qualstats from 1.0.3 to 1.0.4, the out of shared memory errors seems to be gone away. The first time, I saw this error, was during running pg_dump with parallel jobs.

rjuju commented 6 years ago

@rafaelorafaelo Thanks for confirming.

I'll close this issue now, feel free to reopen it needed.

powa-team / powa

invalid memory allocation, out of shared memory #104