newrelic / newrelic-java-agent

The New Relic Java agent
Apache License 2.0
202 stars 143 forks source link

NR agent fails on IBM Semeru Runtimes #841

Closed mstoodle closed 2 years ago

mstoodle commented 2 years ago

Is your feature request related to a problem? Please describe.

Ref discussion here: https://discuss.newrelic.com/t/feature-idea-java-new-relic-agent-doesnt-work-with-openj9-17/170671

When IBM Semeru Runtimes Java 11 recently changed its java.vendor string to "IBM Corporation", the NR Java agent seems to have stopped working and apparently never supported Java 17 or any other recent IBM Java release. When I asked for more details about why the vendor string mattered, I was pointed at this repo to investigate for myself.

I discovered that the AttachLifecycleObserver is disabled for any JVM that sets java.vendor = "IBM Corporation" which includes all Semeru Runtimes (Open and Certified Edition) as well as IBM's older IBM SDK for Java 7, 7.1, and 8. There isn't any reasoning for it in the code, that I could find. I'm suspicious that the reason this change was made is no longer relevant.

Since the NR agent worked just fine for Java 11 before we changed the java.vendor string and the Eclipse OpenJ9 JVM built into IBM Semeru Runtimes uses the same source code across all Java releases (8,11,17,18), I think the NR agent should just work. That probably includes the IBM SDK for Java 8 (which also uses the Eclipse OpenJ9 JVM).

Feature Description

I believe all that's needed is to remove isAgentSafe() from the AttachLifecycleObserver class. I have a pull request ready to go if the project would like me to submit it.

Describe Alternatives

A clear and concise description of any alternative solutions or features you've considered. Are there examples you could link us to?

(migrate to Jira)

Additional context

IBM Semeru Runtimes are the same binaries as used to be built by the AdoptOpenJDK community as "OpenJDK with Eclipse OpenJ9". When AdoptOpenJDK moved to the Eclipse Foundation, they were prevented (by contract with Oracle) from continuing to produce the OpenJDK with OpenJ9 binaries. IBM stepped forward at the time and has produced the same binaries under the name "IBM Semeru Runtimes": built with open source components and the binaries are available at no cost for everyone to use. As chief architect for IBM Java and Semeru Runtimes, I would love to see New Relic agents working smoothly with Semeru Runtimes for all our Java releases.

Priority

Really Want. We have customers who have been happily using NR with Semeru Runtimes 11 who are now broken and, while we have a workaround that involves an agent that mashes back the older java.vendor setting, it's a super hack.

kford-newrelic commented 2 years ago

@mstoodle Thank you for your efforts to determine what was wrong and summarizing that information here - we really appreciate it!

cardonator commented 2 years ago

Would really love to see this change implemented as all of our pipelines broke that rely on OpenJ9 recently.

mstoodle commented 2 years ago

@kford-newrelic In case I can help to accelerate the "back end" of this change (thinking optimistically!) ... what testing should I run locally to help ensure the pull request is working well? I tried running ./gradlew clean build --parallel but that ran into some failures on my laptop even without my changes so wasn't sure if that was the right set of integration tests to be running...

kford-newrelic commented 2 years ago

@mstoodle yes, those are the tests that you can run locally the most easily. There are another set of integration tests that get run when PRs are submitted but it's best to let GHA run those when the PR is submitted.

We've been making some fixes and improvements to our tests, since some have been a bit unreliable. From a practical perspective, you might just want to run before/after your proposed changes and if the outcome is the same from the tests, that might be good enough.

kford-newrelic commented 2 years ago

Not selected for next quarter's agent roadmap. Will consider for a future agent release.

mstoodle commented 2 years ago

Thanks for the update, @kford-newrelic .

For the record, the change needed is very minimal from a coding perspective (See https://github.com/newrelic/newrelic-java-agent/compare/main...mstoodle:ibmjava-changes). But I can understand if you'll want to introduce some additional testing on the free Semeru Runtimes binaries (available here: https://ibm.com/semeru-runtimes/downloads) that it would enable you to support, and that may be effort you aren't willing to commit to in the next quarter.

It is a shame, though, that all those Fortune 500 and other companies that rely on Semeru Runtimes to run their businesses can't use your product in the meantime. I guess not many of them are complaining about it, which means it could represent an untapped market segment for you... I would love to tell all of them how well New Relic "now" works out of the box with IBM Semeru Runtimes :) .

kford-newrelic commented 2 years ago

@mstoodle don't give up on us yet!

If it's possible to squeeze this into our next quarter's work, we certainly will. Stay tuned!

tbradellis commented 2 years ago

Hey @mstoodle. Looks like we built a bypass in for this code path. Can you run a current agent version and use the system property? -Dnewrelic.try_ibm_attach=true It will need to be a system property (not env or newrelic.yml config). This check happens before we initialize our config service. It's an undocumented config. Would you be willing to run with that setting and provide us feedback?
This would help us determine if we should just drop that code path as you've shown in your PR or possibly we'll just document this configuration (or explore other relevant options for improving the IBM Semeru experience).

bogdanbrindusan commented 2 years ago

@tbradellis I tried your suggestion but I can't get it working. Not sure if I've set that flag correctly or not. I'm posting the stacktrace below to have it on the record, as I don't see any other stacktrace here.

2022-06-28T14:12:09,925+0000 [1 1] com.newrelic ERROR: Unable to start the New Relic Agent. Your application will continue to run but it will not be monitored.
java.lang.NoClassDefFoundError: com.newrelic.agent.bridge.datastore.RecordSql
    at com.newrelic.agent.config.TransactionTracerConfigImpl.<init>(TransactionTracerConfigImpl.java:97) ~[newrelic.jar:7.8.0]
    at com.newrelic.agent.config.TransactionTracerConfigImpl.createTransactionTracerConfigImpl(TransactionTracerConfigImpl.java:352) ~[newrelic.jar:7.8.0]
    at com.newrelic.agent.config.TransactionTracerConfigImpl.createTransactionTracerConfig(TransactionTracerConfigImpl.java:345) ~[newrelic.jar:7.8.0]
    at com.newrelic.agent.config.AgentConfigImpl.initTransactionTracerConfig(AgentConfigImpl.java:700) ~[newrelic.jar:7.8.0]
    at com.newrelic.agent.config.AgentConfigImpl.<init>(AgentConfigImpl.java:336) ~[newrelic.jar:7.8.0]
    at com.newrelic.agent.config.AgentConfigImpl.createAgentConfig(AgentConfigImpl.java:283) ~[newrelic.jar:7.8.0]
    at com.newrelic.agent.config.ConfigServiceFactory.createConfigService(ConfigServiceFactory.java:38) ~[newrelic.jar:7.8.0]
    at com.newrelic.agent.Agent.tryToInitializeServiceManager(Agent.java:199) [newrelic.jar:7.8.0]
    at com.newrelic.agent.Agent.continuePremain(Agent.java:140) [newrelic.jar:7.8.0]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
    at com.newrelic.bootstrap.BootstrapAgent.startAgent(BootstrapAgent.java:193) [newrelic.jar:7.8.0]
    at com.newrelic.bootstrap.BootstrapAgent.premain(BootstrapAgent.java:130) [newrelic.jar:7.8.0]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
    at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(Unknown Source) [?:?]
    at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(Unknown Source) [?:?]
Caused by: java.lang.ClassNotFoundException: com.newrelic.agent.bridge.datastore.RecordSql
    at jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source) ~[?:?]
    at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown Source) ~[?:?]
    at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?]
    ... 21 more
zalacmar commented 2 years ago

This would really help as we're stuck on old Openj9

mstoodle commented 2 years ago

Just to confirm, @tbradellis : by "It will need to be a system property" you mean it needs to be specified directly on the command-line, yes?

tbradellis commented 2 years ago

@mstoodle It just needs to be passed as an argument to the java process (vs an application argument).
From the command line it would look like java -javaagent:/path/newrelic.jar -Dnewrelic.try_ibm_attach=true myapp.jar app.jar applicationArgs ... However, just add the property wherever other java args are being added for the deployment config: I often tell folks just wherever they are setting heap or the javaagent argument itself (i.e. -Xms or -Xmx or -javaagent:/path/newrelic.jar) , you'd set this property in the same way.

@bogdanbrindusan I'll see if I can replicate this as soon as I can and comment here. Can probably do that in the coming week.

tbradellis commented 2 years ago

@mstoodle and @bogdanbrindusan and @zalacmar I was able to replicate the java.lang.NoClassDefFoundError: com.newrelic.agent.bridge.datastore.RecordSql and spend some time troubleshooting this. In doing so I discovered that I gave the wrong argument to workaround this. Apologies!

tldr The system property to use is: -Dibm_iv25688_workaround=false It should be set false i.e. do not use the workaround. I confirmed and was able to use the Semeru 17 runtime without any issue.

@bogdanbrindusan You are right, we are pointlessly failing on some IBM runtimes. However, there are a few more places in the code that would need to be cleaned up that are related to what was put in your branch and we may need to take a slightly different approach. Also, I'll need to do some testing and add some semeru et. al. runtimes to our integration tests.

A couple notes: As you were suggesting, this code stems from an old bug that has been resolved. https://www.ibm.com/support/pages/apar/IV25688. This bug only exists in Java 6 and 7 jdks anyway, so some of this code is completely obsolete (as our current Java Agent has dropped support for java 6 and 7).

I started working up an explainer for what the various code paths are doing and why we hit the error when we go down the workaround path. Since this is an older path that deals with changes in classloaders across jdks, I'll need to give a thorough presentation of the control flow to the team for review before ripping it out. I'll post again here for full visibility, because it's pretty interesting. For now, setting -Dibm_iv25688_workaround=false should work.

The crux of the coding issue really was that when we go down the ibmworkaround path we never perform addBridgeJarToClassPath(inst, AGENT_BRIDGE_DATASTORE_JAR_NAME); hence NoClassDefFound for RecordSql which is archived in the agent-bridge-datastore.jar).

Thanks everyone for bringing this up!

For visibilty: The system property to use is: -Dibm_iv25688_workaround=false It should be set false i.e. do not use the workaround. I confirmed and was able to use the Semeru 17 runtime without any issue.

tbradellis commented 2 years ago

There's some bad logic here: https://github.com/newrelic/newrelic-java-agent/blob/da8f5509331144d64a7ba33bf41316066e5c7f42/newrelic-agent/src/main/java/com/newrelic/agent/config/IBMUtils.java#L32

That logic error ensures any IBM 1.8 and above will take the workaround path (true)...fixing this line could resolve the issue as well without removing everything -I'd prefer removing all of it though; it is no longer relevant.
Here is where the workaround path is decided: https://github.com/newrelic/newrelic-java-agent/blob/da8f5509331144d64a7ba33bf41316066e5c7f42/newrelic-agent/src/main/java/com/newrelic/bootstrap/BootstrapAgent.java#L170

kford-newrelic commented 2 years ago

Migrated to Jira - https://issues.newrelic.com/browse/NR-44473

Considering for Q2 engineering work.

bogdanbrindusan commented 2 years ago

@tbradellis thanks for looking into this and put this valuable information here.

@kford-newrelic is that Q2 of 2023? I'm not familiar with how Qs are split in NR. Unless it's meant to be Q3 of 2022

tbradellis commented 2 years ago

Our current gradle version is blocking us from easily integrating Semeru builds into our test suite. Long story, but we had upgrading our gradle version on the road map anyway. This would get us to 7.5.1. Gradle will work well with Semeru builds starting in 7.4. Still some work to be done here because of compileScala jobs https://github.com/newrelic/newrelic-java-agent/pull/997

I was able to force some tests to run on OpenJ9. There are some tests failures. Some look to just be implementation specific and will require some separate tests to be created specifically for OpenJ9 and excluded from other runtimes. summary for agent integration test failures here: https://github.com/newrelic/newrelic-java-agent/pull/993#issuecomment-1238571291

@bogdanbrindusan We are currently in our Q2 2023 fiscal year. The hope was to get this through right away, but these additional hurdles will push this work into Q3 -which I believe starts in Oct.

workato-integration[bot] commented 2 years ago

https://issues.newrelic.com/browse/NEWRELIC-3836

workato-integration[bot] commented 2 years ago

Jira CommentId: 97306 Commented by bellis:

https://github.com/newrelic/newrelic-java-agent/pull/993

workato-integration[bot] commented 2 years ago

Work has been completed on this issue.

tbradellis commented 2 years ago

@bogdanbrindusan and @mstoodle I've removed the IBM workaround logic. We were unable to test as fully as we'd liked. Gradle seems to have trouble with downloading IBM Semeru builds. We hoped to get our full suites running. I could tell in some of my tests that there are some special cases I'll need to address in our integration tests. Not being able to download them locally with gradle made it really cumbersome to step through.

At any rate, this is going into our 7.11 release. There's a snapshot build linked below. If you are able to run this agent and provide any feedback it would be greatly appreciated, even if a simple smoke test to confirm you don't hit any issues. https://github.com/newrelic/newrelic-java-agent/suites/8940160106/artifacts/410261834

Thanks again for bringing this to our attention.

mstoodle commented 2 years ago

Thank you @tbradellis !! That's fantastic news! On the gradle point, would you please take a few moments to summarize the problem you experienced in an issue at the Semeru Runtimes site so we have a record of it and can start investigating?

https://github.com/ibmruntimes/Semeru-Runtimes/issues/new

Thanks again!