Closed cpaelzer closed 6 years ago
The fix that seems to work for me is pushed now and available for your review.
Thank you for the problem analysis and the patch. I found that pool_extract_error_message() is broken because it tries to return bool variable "ret" despite the fact that the function's return type is int. The bug has been there since 3.4 was out, I guess.
Why you see the bug but we haven't seen it? It seems on armhf, char type (that is typedefed to bool in pgpool-II) is unsigned char type, so if it's promoted to int type, then the sign bit is dropped off. On the other hand, on x86 or amd64, char is signed, and the sign bit is preserved when char type is converted to int type. So, this code block: if (pool_extract_error_message(...) > 0) { ereport(LOG, (errmsg("%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", prefix, node_id, ntohl(slot->pid), query, message))); }
does not execute and we are fine.
In summary, I think your patch definitely fixes the problem and I will apply it to the upstream. However I also wants to fix the oversight of pool_extract_error_message() I mentioned above along with your patch.
Yeah I don't mind the exact fix - I agree to the ret of bool being the real root cause. Didn't look at that and wanted to stick with how the other callers handled it in case there was a hidden reason for it.
Please give me a ping here (I guess GH will do) when (either) patch is merged as I then will backport into the current version in Ubuntu - that way I don't have to wait for the next full release bug can get it working.
Thanks for considering my changes, looking forward to hear again when it is integrated.
Oh I've seen the patch merged in the proper upstream repo - thanks! Just not yet mirrored.
IMHO we can close this - thanks for your fast response!
Hi, this was found looking at recent pgpool2 3.7.3 (not 4 but also not outdated I'd think). Automated Ubuntu tests hit an error, but only on armhf.
I opened Ubuntu bug 1777418 to track that and did some debugging.
I found that we hit a segfault (only keeping the interesting top of the backtrace):
I found that this is due to the "message" argument pointing to 0x25000000 and the access to that from the libs on the print function fails.
Now the questions was why this is happening, 0x25000000 seems unwanted and breaks. This is from pool_extract_error_message. In my case by the test it runs there it gets to the point where it won't find an associated error message:
Then on returning from that function it will do the unexpected.
That does not match the definition of the function:
This would also explain why it is arm specific, if this is some magic "int is different here" (odd still but that is what happens).
Yet with that known and seeing that most other calls check for == 1 like
pool_auth.c pool_do_auth 298 if (pool_extract_error_message(false, MASTER(cp), protoMajor, true, &message) == 1)
The fix seems rather easy, I have built it with == 1 in the call to it and it works fine now. I'll prep something against this repo instead of the Ubuntu version and propose it here.