open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.93k stars 842 forks source link

Segmentation fault during thread execution with dynamic security #9330

Open pepperkick opened 1 year ago

pepperkick commented 1 year ago

Describe the bug

Hi Community,

I work in a team whose goal is to enable observability for all the company products which are mostly written in java.

We have an application which runs in Kuberentes as a deployment. The application is deployed using Tomcat server. We injected OpenTelemetry agent via annotation to the application deployment manifest. The agent is successfully injected with the application and we are able to see the basic traces and logs.

During a task it downloads certain jars during runtime and loads them. One of those downloaded jar (proprietary, lets call it dep.jar) has functionality to load security dynamically and then execute a thread pool. The security feature in dep.jar can be enabled or disabled via env variable.

Now, These are the observations

We want to understand why injecting appd agent along with otel agent fixes the Segmentation fault issue. We also suppressed certain libraries (listed below) which fixes the AccessControlException issue but only when appd agent is injected. If appd agent is not injected and only otel agent then it fails with Segmentation fault issue regardless of suppression.

Following libraries were suppressed

Steps to reproduce

Currently not available since proprietary application is used, will try to figure out a easier way to reproduce this

Expected behavior

Expected behavior is that the application will work when only otel agent is injected and libraries are suppressed without appd agent injected.

Actual behavior

Actual behavior is that application crashes with Segmentation fault when otel agent is injected and without appd agent injected.

Javaagent or library instrumentation version

1.23.0

Environment

JDK: Picked up JAVA_TOOL_OPTIONS: -javaagent:/otel-auto-instrumentation/javaagent.jar [otel.javaagent 2023-08-22 13:49:36:433 +0000] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 1.29.0 22-Aug-2023 13:49:44.273 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version name: Apache Tomcat/8.5.81 22-Aug-2023 13:49:44.282 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server built: Jun 8 2022 21:30:15 UTC 22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version number: 8.5.81.0 22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Name: Linux 22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Version: 4.18.0-425.10.1.el8_7.x86_64 22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Architecture: amd64 22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Java Home: /usr/lib/jvm/java-1.8.0-amazon-corretto/jre 22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Version: 1.8.0_382-b05 22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Vendor: Amazon.com Inc. 22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log CATALINA_BASE: /opt/apache-tomcat-8.5.81 22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log CATALINA_HOME: /opt/apache-tomcat-8.5.81 22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -javaagent:/otel-auto-instrumentation/javaagent.jar 22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.util.logging.config.file=/opt/webserver/conf/logging.properties 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:+UseStringDeduplication 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:MaxRAMPercentage=70.0 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:MaxMetaspaceSize=4096m 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:+UseParallelGC 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:ParallelGCThreads=1 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:MaxGCPauseMillis=400 22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Xms1g 22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:ReservedCodeCacheSize=512m 22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:+UseCodeCacheFlushing 22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djdk.tls.ephemeralDHKeySize=2048 22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.protocol.handler.pkgs=org.apache.catalina.webresources 22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.security.egd=file:/dev/./urandom 22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dignore.endorsed.dirs= 22-Aug-2023 13:49:44.287 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dcatalina.base=/opt/webserver 22-Aug-2023 13:49:44.287 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dcatalina.home=/opt/webserver 22-Aug-2023 13:49:44.287 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.io.tmpdir=/opt/webserver/temp

Java Version: openjdk version "1.8.0_382" OpenJDK Runtime Environment Corretto-8.382.05.1 (build 1.8.0_382-b05) OpenJDK 64-Bit Server VM Corretto-8.382.05.1 (build 25.382-b05, mixed mode)

OS: RHEL 8 amazoncorretto:8

Additional context

No response

laurit commented 1 year ago

Usually when openjdk segmentation faults it generates a hs_err_pid<pid>.log file with details of the crash in the current directory of the path set with -XX:ErrorFile. See if you can find this crash log. For running application with security manager see https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/advanced-configuration-options.md#running-application-with-security-manager it may help agains the AccessControlExceptions.

pepperkick commented 1 year ago

Usually when openjdk segmentation faults it generates a hs_err_pid<pid>.log file with details of the crash in the current directory of the path set with -XX:ErrorFile. See if you can find this crash log. For running application with security manager see https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/advanced-configuration-options.md#running-application-with-security-manager it may help agains the AccessControlExceptions.

Appreciate the quick reply. I tried setting the variable OTEL_JAVAAGENT_EXPERIMENTAL_SECURITY_MANAGER_SUPPORT_ENABLED but still faced the Segmentation Fault error. Maybe this is missing in v1.23 so will check later with newer version.

I am not sure but I do not see any hs_err log files. I tried setting -XX:ErrorFile=/tmp/error.log option and made sure it is propagated in pod logs but no error.log file is generated so not sure if the error is getting captured or not.

Will try out more things meanwhile and update if I find more info. Any other input is also appreciated.

laurit commented 1 year ago

You are correct, that flag is missing in 1.23.0. Based on the current information there isn't anything we can do. Usually when jvm crashes it is a bug in the jvm, you may wish to try a different jvm version (e.g. jdk 11) and see if it works better. In case of a jvm crash common advice is to disable the JIT compiler with -Xint and see if this changes the situation. You could also enable core dumps and see if that reveals clues about the nature of the crash. You could also add more debug logging to your application and try to determine whether it crashes in some specific place, perhaps it is possible to create a small program that reproduces the issue that way.

pepperkick commented 1 year ago

Thanks for the additional suggestions.

Trying different JDK version unfortunately is not an option currently since the application was created with java 8 and was never upgraded or tested with new java version.

I will try disabling JIT and see what happens. Also as per suggestion I will try to build a simple application which reproduces the issue. I have started going through the code base for this.

I did try dumping core but could not find anything meaningful pointing to the JVM crash, that said I am not really an expert on Java related debugging stuff.

breedx-splk commented 1 year ago

@pepperkick Do you know if the proprietary dep.jar is doing anything with native code, like JNI? Have you also opened an issue with the team that creates dep.jar to see if it is something they can look into?

pepperkick commented 1 year ago

@pepperkick Do you know if the proprietary dep.jar is doing anything with native code, like JNI? Have you also opened an issue with the team that creates dep.jar to see if it is something they can look into?

As far as I am aware it does not. Will check with the team when I can. All are busy with release cycle so unable to get much info.

Also I want to understand why adding the appd agent fixes the Segmentation Fault? Is appd agent known to fix some issues similar to this?

trask commented 1 year ago

Also I want to understand why adding the appd agent fixes the Segmentation Fault? Is appd agent known to fix some issues similar to this?

you'd need to ask the appd folks, but it wouldn't be surprising to me if some unexpected interaction with appd avoids the JVM crash (see heisenbug 😄)

pepperkick commented 1 year ago

Also I want to understand why adding the appd agent fixes the Segmentation Fault? Is appd agent known to fix some issues similar to this?

you'd need to ask the appd folks, but it wouldn't be surprising to me if some unexpected interaction with appd avoids the JVM crash (see heisenbug 😄)

Yeah, I agree :P Wanted to confirm if something is already known around this. Seems like it is an edge case. I have not managed to reproduce the issue separately yet. It will take some time I think.

laurit commented 1 year ago

You could try with a fastdebug build from https://downloads.corretto.aws/#/downloads?build=nightly&version=8 Fastdebug builds contain extra assertions that are removed from release builds, perhaps this will provide a clue about why the crash happens.