ibmruntimes / Semeru-Runtimes

Issue repo for all things IBM Semeru Runtimes
14 stars 3 forks source link

Object is cleaned too soon during unit testing #93

Open pedrolamarao opened 2 weeks ago

pedrolamarao commented 2 weeks ago

We have a modular system with a service provider scheme.

One of our providers is gRPC-based. We use a Cleaner to shutdown the managed channel when the provided service is collected. This is how the provider builds the service:

final var builder = Grpc.newChannelBuilder(target,credentials);
builder.defaultServiceConfig(serviceConfig);
final var channel = builder.build();
final var service = new GrpcFoo( getLogger(GrpcFoo.class), newBlockingStub(channel) );
cleaner.register(service,channel::shutdownNow);
return service;

In our unit testing, there is a test like this:

@Test
void smoke ()
{
  final var foo = new GrpcProvider().open(locator);
  assertNotNull( foo );
  assertTrue( foo.exists("AES128") );
}

This test passes in Adoptium and Liberica 11 and 17 JDKs, but fails in Semeru 11 and 17 JDKs with a Channel shutdownNow invoked exception. Debugging confirms that this is happening because the Cleaner is calling the Runnable while the test is still running.

If we refactor the test to something like this:

volatile GrpcFoo foo;

@Test
void smoke ()
{
  foo= new GrpcProvider().open(locator);
  assertNotNull( foo );
  assertTrue( foo.exists("AES128") );
}

the test passes.

It seems that the object is being collected too soon.

pshipton commented 2 weeks ago

@dmitripivkine @tajila fyi, pls take a look.

dmitripivkine commented 2 weeks ago

Is this test failing with -Xint too?

dmitripivkine commented 2 weeks ago

This is unlikely GC issue. Sounds as VM (or even JIT potentially). GC does not control set of the roots.

pedrolamarao commented 2 weeks ago

-Xint did not change the result with semeru-17.0.9 for Windows 64 bits.

pedrolamarao commented 2 weeks ago

Let me give you code a little less "redacted" to consider.

The Cleaner is defined here.

public interface KeysProvider
{
    Cleaner cleaner = Cleaner.create();

    // etc.
}

The channel is registered with the Cleaner in this method:

public final class GrpcProvider implements KeysProvider
{
    // other stuff

    @Override
    public Keys open (URI locator)
    {
       try 
       {
            // lots of peculiar configuration parsing

            final var builder = Grpc.newChannelBuilder(target,credentials);
            builder.defaultLoadBalancingPolicy("round_robin");
            builder.defaultServiceConfig(serviceConfig);

            final var channel = builder.build();
            final var service = new GrpcKeys( getLogger(GrpcKeys.class), newBlockingStub(channel) );
            cleaner.register(service,channel::shutdownNow);
            return service;
        }
        catch (Exception e)
        {
            throw new KeysException("not expected",e);
        }
    }
}

and the JUnit tests as above.

tajila commented 2 weeks ago

@babsingh Please take a look at this

babsingh commented 2 weeks ago

@pedrolamarao Can you provide a micro-test that I can run locally to reproduce the failure? It's needed to generate artifacts for introspecting JVM behavior.

babsingh commented 2 weeks ago

Below is the testcase that I derived from the above information. I am unable to reproduce the issue.

// Filename: Test2.java

import java.lang.ref.Cleaner;
import java.lang.ref.Reference;

public class Test2 {
        public Object getObject() {
                Cleaner cleaner = Cleaner.create();
                var obj = new String("a new string");
                Cleaner.Cleanable c = cleaner.register(obj, () -> { throw new RuntimeException("object cleaned"); } );
                return obj;
        }

        public static void main(String[] args) throws Throwable {
                Test2 test = new Test2();

                final var obj = test.getObject();

                System.out.println(obj);

                /* Invoke GC. */
                for (int i = 0; i < 100; i++) {
                        System.gc();
                        Thread.sleep(100);
                }

                System.out.println(obj);
        }
}
babsingh commented 2 weeks ago

Assuming that KeysException is thrown when the test fails. JVM option: -Xdump:java+system+heap:events=throw,filter=*/KeysException will generate the required artifacts. Here is the doc for -Xdump for troubleshooting.

@pedrolamarao

pedrolamarao commented 1 week ago

The test is new; the Semeru JDK was not recently updated.

openjdk version "17.0.9" 2023-10-17
IBM Semeru Runtime Open Edition 17.0.9.0 (build 17.0.9+9)
Eclipse OpenJ9 VM 17.0.9.0 (build openj9-0.41.0, JRE 17 Windows 11 amd64-64-Bit Compressed References 20231017_568 (JIT enabled, AOT enabled)
OpenJ9   - 461bf3c70
OMR      - 5eee6ad9d
JCL      - 3699725139c based on jdk-17.0.9+9)

I have attached the -Xdump for gRPC's exception StatusRuntimeException which seems to me the interesting time point. I have attached only the small text files. If you require the large binary files, please advise how to share them.

javacore.20241001.151523.11480.0003.txt javacore.20241001.151527.11480.0006.txt javacore.20241001.151528.11480.0009.txt

babsingh commented 1 week ago

@pedrolamarao

pedrolamarao commented 1 week ago

Yes, it reproduces with version 17.0.12+7. I have uploaded -Xdump files to the upload link provided.

Is -Xdump capable of breaking on arbitrary program points? Could I set it to dump on a specific file/line location?

I may find time to try a minimal reproducer, but not soon.

babsingh commented 1 week ago

Thanks for the core files. I will see if they pin-point the root cause.

-Xdump will need to be used in conjunction with jdb to produce core files at specific file/line locations. The below steps highlight this process.

Could I set it to dump on a specific file/line location?

You can set breakpoints through jdb.

Is -Xdump capable of breaking on arbitrary program points?

babsingh commented 1 week ago

@pedrolamarao Does the failure go away if we change the GC policy to -Xgcpolicy:nogc and -Xgcpolicy:optthruput? The default GC policy is gencon. More GC policies are here: https://eclipse.dev/openj9/docs/xgcpolicy.

pedrolamarao commented 1 week ago

Test results with various gcpolicy:

babsingh commented 1 week ago

@pedrolamarao Would it be possible for you to invoke System.gc() both before and after ClientCalls.getUnchecked (ClientCalls.java:249) and run the tests using Adoptium and Liberica JDKs? OpenJ9's garbage collection is known to be more aggressive compared to OpenJDK/Hotspot, particularly when it comes to object collection.

Since there might not be any active references to the object while the instance method is still executing, the object could potentially be eligible for collection by the GC. If manually invoking the garbage collector (via System.gc()) with Adoptium and Liberica JDKs results in the same issue, it could indicate that the object is indeed unreachable.

However, if this issue does not occur with these JDKs, we may need to delve deeper into OpenJ9. The core files only show the final state and don't provide insight into how the JVM reached that state. Further investigation through more active debugging (e.g., logging) may be required, though this approach is typically more tedious and time-consuming. That said, having a micro test that consistently reproduces the issue would significantly ease the process and help narrow down the root cause more quickly.

4XESTACKTRACE                at io/grpc/stub/ClientCalls.getUnchecked(ClientCalls.java:249) <--- Objects are UNREACHABLE
4XESTACKTRACE                at io/grpc/stub/ClientCalls.blockingUnaryCall(ClientCalls.java:167)
4XESTACKTRACE                at dev/prodist/keys/services/KeysGrpc$KeysBlockingStub.exists(KeysGrpc.java:879)
4XESTACKTRACE                at dev/prodist/keys/grpc/GrpcKeys.exists(GrpcKeys.java:73)
4XESTACKTRACE                at dev/prodist/keys/grpc/GrpcConfigurationTest.smoke(GrpcConfigurationTest.java:32)
pedrolamarao commented 1 week ago

We have a reproducer here: https://github.com/pedrolamarao/semeru-issue-93

With JAVA_HOME set to this:

PS P:\dev\semeru-issue-93> P:\jdk-17.0.12+7\bin\java -version
openjdk version "17.0.12" 2024-07-16
IBM Semeru Runtime Open Edition 17.0.12.1 (build 17.0.12+7)
Eclipse OpenJ9 VM 17.0.12.1 (build openj9-0.46.1, JRE 17 Windows 11 amd64-64-Bit Compressed References 20240716_793 (JIT enabled, AOT enabled)
OpenJ9   - 4760d5d320
OMR      - 840a9adba
JCL      - e70fd1fd64a based on jdk-17.0.12+7)

running ./gradlew test on this project fails.

babsingh commented 6 days ago

@pedrolamarao I am getting the below errors. Am I missing steps or dependencies on my end?

Starting a Gradle Daemon, 1 incompatible and 1 stopped Daemons could not be reused, use --status for details
> Task :grpc:extractIncludeProto FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':grpc:extractIncludeProto'.
> Could not resolve all files for configuration ':grpc:compileProtoPath'.
   > Cannot resolve external dependency javax.annotation:javax.annotation-api:1.3.2 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency com.google.protobuf:protobuf-java:4.26.1 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-core:1.63.0 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-protobuf:1.63.0 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-services:1.63.0 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-stub:1.63.0 because no repositories are defined.
     Required by:
         project :grpc

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
> Get more help at https://help.gradle.org.

BUILD FAILED in 12s
3 actionable tasks: 3 executed
pedrolamarao commented 6 days ago

Sorry, repositories must be coming from environment configuration. I have added the necessary repository configuration to Gradle scripts, and validated in a separate environment.