tada / pljava

PL/Java is a free add-on module that brings Java™ Stored Procedures, Triggers, Functions, Aggregates, Operators, Types, etc., to the PostgreSQL™ backend.
http://tada.github.io/pljava/
Other
247 stars 79 forks source link

SIGSEGV under high load #265

Open radist-nt opened 4 years ago

radist-nt commented 4 years ago

Hi. About a year ago we used pl/java in production (it was version 1.5.1-BETA2, or may be even 1.5.3-SNAPSHOT), but encountered periodic database restarts due to fatal errors in pl/java related code. We have migrated our code to pl/python (working slower and requires superuser, no fatal errors anymore). But this information may be useful for project development.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000000007d2da0, pid=36458, tid=0x00007f7bf38488c0
#
# JRE version: OpenJDK Runtime Environment (8.0_191-b12) (build 1.8.0_191-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.191-b12 mixed mode, sharing linux-amd64 compressed oops)
# Problematic frame:
# C  [postgres: user database 127.0.0.1(49189) SELECT+0x3d2da0]  pg_detoast_datum+0x0
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x0000000001d35000):  JavaThread "main" [_thread_in_native, id=36458, stack(0x00007ffc74aea000,0x00007ffc74cea000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

The most frequently used pl/java function is test sanitizing with com.googlecode.owasp-java-html-sanitizer:owasp-java-html-sanitizer The main part of pl/java function code is

import org.owasp.html.HtmlPolicyBuilder;
import org.owasp.html.PolicyFactory;
...
    private static final PolicyFactory NO_HTML_POLICY = new HtmlPolicyBuilder().toFactory();
    @Function (name = "sanitize_plain_text", schema = "public", onNullInput = OnNullInput.RETURNS_NULL, security = Security.INVOKER, effects = Effects.IMMUTABLE, parallel = Parallel.SAFE, leakproof = true, trust = Trust.SANDBOXED, comment = "Removes html tags and replace dangerous characters")
    public static String plainTextSanitize(String text) {
        if (text == null)
            return null;
        return NO_HTML_POLICY.sanitize(text);
    }
jcflack commented 4 years ago

Hi,

I notice in your example code that the function is annotated Parallel.SAFE.

Did you also encounter the segfault at any time with the default setting of Parallel.UNSAFE ?

As mentioned in the release notes, the user guide section on parallel query, and the wiki page on parallel query,

Although RESTRICTED and SAFE Java functions work in simple tests, there has been no exhaustive audit of the code to ensure that PL/Java’s internal workings never violate the behavior constraints on such functions. The support should be considered experimental, and could be a fruitful area for beta testing.

and

there may still be cases where a forbidden operation results from the internal workings of PL/Java itself. This has not been seen in testing (simple parallel queries with RESTRICTED or SAFE PL/Java functions work fine), but to rule out the possibility would require a careful audit of PL/Java's code. Until then, it would be prudent for any application involving parallel query with RESTRICTED or SAFE PL/Java functions to be first tested in a non-production environment.

I should probably make the Javadoc comments link to the user guide section to make those notes more likely to be seen.

Out of curiosity, what considerations led to marking the function Parallel.SAFE? I think use cases seeing a performance benefit would be rare, given that every worker process participating in a parallel query with a Java function marked SAFE would have to start its own JVM.

If the segfault is not reproducible with the default Parallel.UNSAFE then we should probably just add the details of your situation to the notes section of the parallel-query wiki page.

radist-nt commented 4 years ago

Current our setup is "PostgreSQL 9.6.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit". Last year setup was 9.6.11 or 9.6.12 (I don't remember exaclty). Afaik, PostgreSQL 9.6 uses parallel execution quite rarely (I've never seen parallel nodes in query plans). Also, 100% of sanitize_plain_text calls performed from another functions declared as parallel unsafe and 99% of them are calls from pl/pgsql code or sql with very little ammount of data.

Unfortunately, I cann't switch to pl/java code to test function with Parallel.UNSAFE, so I could not establish whether it was a Parallel.UNSAFE issue.

jcflack commented 4 years ago

Hmm.

Would you be able to attach the entire hserr file that you pasted a portion of in the first comment?

If PostgreSQL came from Red Hat packages, I can probably obtain the corresponding versions and debuginfo. It sounds as if PL/Java was locally built. Do you still have the libpljava-so-*.so file that was in use at the time?

radist-nt commented 4 years ago

Here is hserr files left: hs_err.log.zip PL/Java was built locally (latest build commit was 78ef01b6e0b4b according to the local repo state), Don't sure whether PostgreSQL came from packages... I'll ask DBA about postgresql build and libpljava-so-*.so file.

radist-nt commented 4 years ago

Sorry, didn't find the libpljava-so-*.so file.

jcflack commented 4 years ago

I am doubtful that I can do much with this. There is only one frame in the stack trace, which is PostgreSQL's own pg_detoast_datum routine. Clearly, it was passed a null pointer. The absence of any calling stack frames leaves no practical way to determine where it was called from.

The call site might not even be within PL/Java. That doesn't mean I question that PL/Java is involved, but there might have been a null value returned at some point that is now being passed to pg_detoast_datum from elsewhere in PostgreSQL code. Based on the information available here, without a test case that can reproduce it, I may be at a dead end.

The lack of caller stack frames probably indicates that PostgreSQL was built without the -fno-omit-frame-pointer compiler option, so it wasn't possible to trace the caller frames back. In a PostgreSQL server built with that option, the hs_err file might give more useful information.