pgpool / pgpool2

This is the official mirror of git://git.postgresql.org/git/pgpool2.git. Note that this is just a *mirror* - we don't work with pull requests on github. Please subscribe to pgpool-hackers mailing list from our website and submit your patch to this mailing list.
https://www.pgpool.net
Other
305 stars 87 forks source link

pgpool2 crashing on armhf #14

Closed cpaelzer closed 6 years ago

cpaelzer commented 6 years ago

Hi, this was found looking at recent pgpool2 3.7.3 (not 4 but also not outdated I'd think). Automated Ubuntu tests hit an error, but only on armhf.

I opened Ubuntu bug 1777418 to track that and did some debugging.

I found that we hit a segfault (only keeping the interesting top of the backtrace):

Program received signal SIGSEGV, Segmentation fault.
#0 strlen () at ../sysdeps/arm/armv6t2/strlen.S:85
#1 0x76b3c754 in _IO_vfprintf_internal (s=s@entry=0x7e974780,
    format=format@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", ap=..., ap@entry=...)
    at vfprintf.c:1643
#2 0x76baaba6 in ___vsnprintf_chk (
    s=s@entry=0x676de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 2827 statement: \"CREATE TABLE autocommit (a int);\" message: \"BLE IF EXISTS utocommit;\" message: \"table \"utocommit"...,
    maxlen=<optimized out>, maxlen@entry=1024, flags=flags@entry=1, slen=slen@entry=4294967295,
    format=format@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", args=args@entry=...)
    at vsnprintf_chk.c:63
#3 0x0048e7ac in vsnprintf (__ap=..., __fmt=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"",
    __n=1024,
    __s=0x676de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 2827 statement: \"CREATE TABLE autocommit (a int);\" message: \"BLE IF EXISTS utocommit;\" message: \"table \"utocommit"...)
    at /usr/include/arm-linux-gnueabihf/bits/stdio2.h:77
#4 pvsnprintf (
    buf=0x676de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 2827 statement: \"CREATE TABLE autocommit (a int);\" message: \"BLE IF EXISTS utocommit;\" message: \"table \"utocommit"..., len=len@entry=1024,
    fmt=fmt@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", args=...)
    at utils/psprintf.c:124
#5 0x004bc13c in appendStringInfoVA (str=str@entry=0x7e9748dc,
    fmt=fmt@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", args=..., args@entry=...)
    at stringinfo.c:131
#6 0x0049a12e in errmsg (fmt=0x76f5a968 <__stack_chk_guard> "") at ../../src/utils/error/elog.c:658
#7 0x0047ff60 in per_node_error_log (backend=backend@entry=0x690998, node_id=node_id@entry=0,
    query=query@entry=0x6a4190 "CREATE TABLE autocommit (a int);",
    prefix=prefix@entry=0x4e3220 "pool_send_and_wait: Error or notice message from backend: ", unread=unread@entry=1 '\001')
    at protocol/pool_proto_modules.c:3141

I found that this is due to the "message" argument pointing to 0x25000000 and the access to that from the libs on the print function fails.

(gdb) p prefix
$24 = 0x4e3220 "pool_send_and_wait: Error or notice message from backend: "
(gdb) p node_id
$25 = 0
(gdb) p slot->pid
$26 = 185270272
(gdb) p query
$27 = 0x6a4190 "CREATE TABLE autocommit (a int);"
(gdb) p message
$28 = 0x25000000 <error: Cannot access memory at address 0x25000000>

Now the questions was why this is happening, 0x25000000 seems unwanted and breaks. This is from pool_extract_error_message. In my case by the test it runs there it gets to the point where it won't find an associated error message:

4366 if (kind != 'E' && kind != 'N')
4367 {
4368 pool_unread(backend, str_buf->data, readlen);
4369 »···»···»···»···pfree(str_message_buf->data);
4370 »···»···»···»···pfree(str_message_buf);
4371 »···»···»···»···pfree(str_buf->data);
4372 »···»···»···»···pfree(str_buf);
4373 ereport(ERROR,
4374 (errmsg("unable to extract error message"),
4375 errdetail("invalid message kind \"%c\"",kind)));
4376 }

Then on returning from that function it will do the unexpected.

(gdb) fin
Run till exit from #0 pool_extract_error_message (read_kind=read_kind@entry=1 '\001', backend=backend@entry=0x130e9a0, major=3,
    unread=unread@entry=1 '\001', message=message@entry=0x7ee61928) at protocol/pool_process_query.c:4439
0x004e3f24 in per_node_error_log (backend=backend@entry=0x130d998, node_id=node_id@entry=0,
    query=query@entry=0x1321190 "CREATE TABLE autocommit (a int);",
    prefix=prefix@entry=0x547220 "pool_send_and_wait: Error or notice message from backend: ", unread=unread@entry=1 '\001')
    at protocol/pool_proto_modules.c:3139
3139 if (pool_extract_error_message(true, CONNECTION(backend, node_id), MAJOR(backend), true, &message) > 0)
Value returned is $26 = 255
(gdb) p *message
Cannot access memory at address 0x25000000

That does not match the definition of the function:

  • Return values are:
  • 0: not error or notice message
  • 1: succeeded to extract error message
  • -1: error)

This would also explain why it is arm specific, if this is some magic "int is different here" (odd still but that is what happens).

Yet with that known and seeing that most other calls check for == 1 like pool_auth.c pool_do_auth 298 if (pool_extract_error_message(false, MASTER(cp), protoMajor, true, &message) == 1)

The fix seems rather easy, I have built it with == 1 in the call to it and it works fine now. I'll prep something against this repo instead of the Ubuntu version and propose it here.

cpaelzer commented 6 years ago

The fix that seems to work for me is pushed now and available for your review.

tatsuo-ishii commented 6 years ago

Thank you for the problem analysis and the patch. I found that pool_extract_error_message() is broken because it tries to return bool variable "ret" despite the fact that the function's return type is int. The bug has been there since 3.4 was out, I guess.

Why you see the bug but we haven't seen it? It seems on armhf, char type (that is typedefed to bool in pgpool-II) is unsigned char type, so if it's promoted to int type, then the sign bit is dropped off. On the other hand, on x86 or amd64, char is signed, and the sign bit is preserved when char type is converted to int type. So, this code block: if (pool_extract_error_message(...) > 0) { ereport(LOG, (errmsg("%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", prefix, node_id, ntohl(slot->pid), query, message))); }

does not execute and we are fine.

In summary, I think your patch definitely fixes the problem and I will apply it to the upstream. However I also wants to fix the oversight of pool_extract_error_message() I mentioned above along with your patch.

cpaelzer commented 6 years ago

Yeah I don't mind the exact fix - I agree to the ret of bool being the real root cause. Didn't look at that and wanted to stick with how the other callers handled it in case there was a hidden reason for it.

Please give me a ping here (I guess GH will do) when (either) patch is merged as I then will backport into the current version in Ubuntu - that way I don't have to wait for the next full release bug can get it working.

Thanks for considering my changes, looking forward to hear again when it is integrated.

cpaelzer commented 6 years ago

Oh I've seen the patch merged in the proper upstream repo - thanks! Just not yet mirrored.

IMHO we can close this - thanks for your fast response!