powa-team / pg_qualstats

A PostgreSQL extension for collecting statistics about predicates, helping find what indices are missing
Other
274 stars 26 forks source link

pg_qualstats causing coredump and db crash #3

Closed bbotev01 closed 6 years ago

bbotev01 commented 8 years ago

I have been using pg_qualstats version 0.0.7 for several months without issues, but since we upgraded to Postgres 9.5.0 in late January one of our particularly load intensive databases has experienced 2 crashes related to pg_qualstats. Here is what the postgres log looks like:

2016-03-31 10:09:50.216 EDT [18427]: [10-1] user=,db=,host=,sessionid=56bf3b24.47fb LOG: server process (PID 13930) was terminated by signal 11: Segmentation fault 2016-03-31 10:09:50.216 EDT [18427]: [11-1] user=,db=,host=,sessionid=56bf3b24.47fb DETAIL: Failed process was running: COMMIT 2016-03-31 10:09:50.216 EDT [18427]: [12-1] user=,db=,host=,sessionid=56bf3b24.47fb LOG: terminating any other active server processes

And here is the corresponding message in the var log messages: kernel: postgres[13930]: segfault at 8 ip 00007f9fb41ca664 sp 00007ffe696f7b70 error 4 in pg_qualstats.so[7f9fb41c5000+7000]

The issue occurs even when the extension is not created in any of the databases. The above crash occurred just having the pg_qualstats library loaded with shared_preload_libraries. The crash is infrequent and I am not able to reproduce the condition causing the crash. I hope this is helpful in figuring out the issue causing this behavior. Thanks, Boyan Botev

rjuju commented 8 years ago

Ouch :(

The fact that the extension isn't created in any database should not have any impact, it only add SQL access functions to see the content of pg_qualstats shared memory.

Is it a linux x86 64 bits server (which distribution), and do you have a core dump by any chance? Could you install the postgresql and pg_qualstats debug information package, that'd help to get more useful logs, and know where the crash happened in pg_qualstats.

What's the result of addr2line -e postgres 00007f9fb41ca664 ?

What kind of load is there on the database (read and/write queries, kind of WHERE or JOIN clauses, datatype, operators or any relevant detail).

bbotev01 commented 8 years ago

It's 64bit RHEL. In particular 2.6.32-573.12.1.el6.x86_64. Unfortunately this issue only occurred on the production server. It happened 5 times in total. I had to take the extension out completely yesterday by removing it from the shared_preload_libraries. I'll install the debug package on a lower environment, but don't want to risk more crashes in the prod environment for the moment. I'll do my best to try to reproduce it in another place. I noticed that the process which caused the crash was running a nasty SQL select statement with 1750 bind variables all inside a where clause. It seems to be coming from an ORM tool because I don't think a person could write something like this. I also don't have the coredumps because the dumps could not fit in the location where they were being dumped.

The result of $ addr2line -e postgres 00007f9fb41ca664 ??:0

On Thu, Apr 7, 2016 at 4:34 AM, Julien Rouhaud notifications@github.com wrote:

Ouch :(

The fact that the extension isn't created in any database should not have any impact, it only add SQL access functions to see the content of pg_qualstats shared memory.

Is it a linux x86 64 bits server (which distribution), and do you have a core dump by any chance? Could you install the postgresql and pg_qualstats debug information package, that'd help to get more useful logs, and know where the crash happened in pg_qualstats.

What's the result of addr2line -e postgres 00007f9fb41ca664 ?

What kind of load is there on the database (read and/write queries, kind of WHERE or JOIN clauses, datatype, operators or any relevant detail).

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/dalibo/pg_qualstats/issues/3#issuecomment-206759961

rjuju commented 8 years ago

Yes, it's better to avoid crashing production server again.

Could you send the query and definition of involved tables? If you need privacy, you can send it to powa(AT)dalibo.com. If you can't share them, can you specify the datatypes and operators used, and if they are ANDed or Ored? Also if it was a SELECT or other statement kind.

rjuju commented 8 years ago

Hello,

As I couldn't reproduce this issue with the queries you provided off-list, I tried sqlsmith to find issues, and found this one: https://github.com/dalibo/pg_qualstats/issues/5

Do you also use this kind of JOIN on expression?

bbotev01 commented 8 years ago

You may be onto something here. The query preceding each crash has a view with a bunch of subselects and expression joins in it. It is very possible this is the same issue.

On Tue, May 31, 2016 at 4:07 PM, Julien Rouhaud notifications@github.com wrote:

Hello,

As I couldn't reproduce this issue with the queries you provided off-list, I tried sqlsmith to find issues, and found this one: #5 https://github.com/dalibo/pg_qualstats/issues/5

Do you also use this kind of JOIN on expression?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dalibo/pg_qualstats/issues/3#issuecomment-222804254, or mute the thread https://github.com/notifications/unsubscribe/AP70HfYc73kSO84SAj0evAtZDIDtY4loks5qHJTwgaJpZM4H_VGA .

rdunklau commented 8 years ago

Could you try against the latest release (1.0.0) and let us know if it fixes it ? Thanks.

rjuju commented 7 years ago

Hello @bbotev01, did you have the opportunity to test a newer release (1.0.2 is highly recomended).

For reference, your problem should have been fixed by https://github.com/dalibo/pg_qualstats/commit/e0ddf700f4e3f7e3da0320c62a7d89c2378798fb.