oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.02k stars 1.6k forks source link

Class initialization results are unstable #3127

Closed ziyilin closed 6 months ago

ziyilin commented 3 years ago

Describe the issue Class initialization results printed during native image generation are unstable. Not sure if it's initialization issue or reporting issue.

Steps to reproduce the issue Please include both build steps as well as run steps

  1. Get a Spring Boot demo project from https://start.spring.io/#!type=maven-project&language=java&platformVersion=2.4.2.RELEASE&packaging=jar&jvmVersion=11&groupId=com.example&artifactId=demo&name=demo&description=Demo%20project%20for%20Spring%20Boot&packageName=com.example.demo
  2. Click the "Generate" button on the page to download the zipped project.
  3. Unzip the downloaded file and run the following script in the directory. Suppose the GRAALVM_HOME is set to a graalvm compiled with JDK11 and the JAVA_HOME is JDK11.
    #!/bin/bash
    mvn install
    echo "BUILD_TIME  RUN_TIME"
    for i in {0..9}
    do
    $GRAALVM_HOME/bin/native-image --report-unsupported-elements-at-runtime -H:+AllowIncompleteClasspath -H:FallbackThreshold=0 -H:+PrintClassInitialization -H:+ExitAfterAnalysis -jar target/demo-0.0.1-SNAPSHOT.jar | egrep "\s[0-9]*\s" -o | tr "\n" " " | awk  '{print "  "$2"       "$4}'
    done

    The script statically analyzes the demo with reporting the class initialization 10 times. The result is printed out like:

    BUILD_TIME  RUN_TIME
    3411       17
    3411       17
    4912       190
    3421       17
    4585       190
    4585       190
    4922       190
    3421       17
    3759       17
    3759       17

    This result shows the BUILD_TIME and RUN_TIME initialization classes are not stable.

Describe GraalVM and your environment:

ziyilin commented 3 years ago

Update: I found many differences are related with java security. I tracked one of the unstable class org.xml.sax.helpers.DefaultHandler and found the cause is the typeflow state is changing from builds. Adding the following debug code:

diff --git a/substratevm/src/com.oracle.graal.pointsto/src/com/oracle/graal/pointsto/flow/TypeFlow.java b/substratevm/src/com.oracle.graal.pointsto/src/com/oracle/graal/pointsto/flow/TypeFlow.java
index 1e6061b47d1..06216fa7d19 100644
--- a/substratevm/src/com.oracle.graal.pointsto/src/com/oracle/graal/pointsto/flow/TypeFlow.java
+++ b/substratevm/src/com.oracle.graal.pointsto/src/com/oracle/graal/pointsto/flow/TypeFlow.java
@@ -466,6 +466,9 @@ public abstract class TypeFlow<T> {

     /** Let the observers know that the state has changed. */
     protected void notifyObservers(BigBang bb) {
+        if(this instanceof  NullCheckTypeFlow && source != null && source.toString().equals("at java.security.SecureRandom.getDefaultPRNG(SecureRandom.java:290) [bci: 125]")){
+            System.out.println("states is " + state);
+        }
         for (TypeFlow<?> observer : getObservers()) {
             observer.onObservedUpdate(bb);
         }

I can see the following printed out messages for different builds: Build1:

states is 1TypeMObject<[0x000002DA000002DB:!S:Provider$Service]>
states is MTypeMObject<2:[0x000002DA000002DB:!S:Provider$Service, 0x00000CA000000CA1:!S:SunEC$ProviderService]>
states is MTypeMObject<3:[0x000002DA000002DB:!S:Provider$Service, 0x00000CA000000CA1:!S:SunEC$ProviderService, 0x00000CD200000CD3:!S:SunPCSC$ProviderService]>

Build Time: 3421 classes Run Time: 17 classes

Build2:

states is MTypeMObject<6:[0x000002DA000002DB:!S:Provider$Service, 0x00000C9300000C94:!S:SunEC$ProviderService, 0x00000CF200000CF3:!S:JdkSASL$ProviderService, 0x00000CF300000CF4:!S:XMLDSigRI$ProviderService, 0x00000CF500000CF6:!S:SunProvider$ProviderService, 0x00000D6200000D63:!S:SunPKCS11$P11Service]>

Build Time: 3411 classes Run Time: 17 classes

Build3:

states is MTypeMObject<2:[0x000002DA000002DB:!S:Provider$Service, 0x00000C9300000C94:!S:SunEC$ProviderService]>
states is MTypeMObject<5:[0x000002DA000002DB:!S:Provider$Service, 0x00000C9300000C94:!S:SunEC$ProviderService, 0x00000CF200000CF3:!S:JdkSASL$ProviderService, 0x00000CF300000CF4:!S:XMLDSigRI$ProviderService, 0x00000CF500000CF6:!S:SunProvider$ProviderService]>
states is MTypeMObject<6:[0x000002DA000002DB:!S:Provider$Service, 0x00000C9300000C94:!S:SunEC$ProviderService, 0x00000CF200000CF3:!S:JdkSASL$ProviderService, 0x00000CF300000CF4:!S:XMLDSigRI$ProviderService, 0x00000CF500000CF6:!S:SunProvider$ProviderService, 0x00000D6200000D63:!S:SunPKCS11$P11Service]>

Build Time: 4912 classes Run Time: 190 classes

But still don't know why TypeFlow.state is variant.

ziyilin commented 3 years ago

I have found the reason. It's caused by PlatformClassLoader introduced since JDK9. When the demo is built with GraalVM compiled with JDK8, the initialized class numbers are stable.

 BUILD_TIME  RUN_TIME
  2644       16
  2644       16
  2644       16
  2644       16
  2644       16
  2644       16
  2644       16
  2644       16
  2644       16
  2644       16

The following discussion is based on GraalVM compiled with JDK11. The version information is

openjdk version "11.0.9" 2020-10-20
OpenJDK Runtime Environment GraalVM 21.1.0-dev (build 11.0.9+10-jvmci-21.0-b03)
OpenJDK 64-Bit Server VM GraalVM 21.1.0-dev (build 11.0.9+10-jvmci-21.0-b03, mixed mode, sharing)

Native-image reflectively calls sun.security.ssl.TrustStoreManager.getTrustedCerts in com.oracle.svm.core.jdk.TrustStoreManagerFeature.afterRegistration. It comes to load the "SunEC" provider with the followings stack trace:

load:338, ProviderConfig$ProviderLoader (sun.security.jca)
run:248, ProviderConfig$3 (sun.security.jca)
run:242, ProviderConfig$3 (sun.security.jca)
doPrivileged:-1, AccessController (java.security)
doLoadProvider:242, ProviderConfig (sun.security.jca)
getProvider:222, ProviderConfig (sun.security.jca)
getProvider:266, ProviderList (sun.security.jca)
getService:379, ProviderList (sun.security.jca)
getInstance:157, GetInstance (sun.security.jca)
getImpl:700, Security (java.security)
getInstance:158, AlgorithmParameters (java.security)
decodeParams:126, AlgorithmId (sun.security.x509)
<init>:119, AlgorithmId (sun.security.x509)
parse:393, AlgorithmId (sun.security.x509)
parse:168, X509Key (sun.security.x509)
<init>:75, CertificateX509Key (sun.security.x509)
parse:674, X509CertInfo (sun.security.x509)
<init>:169, X509CertInfo (sun.security.x509)
parse:1842, X509CertImpl (sun.security.x509)
<init>:194, X509CertImpl (sun.security.x509)
engineGenerateCertificate:105, X509Factory (sun.security.provider)
generateCertificate:355, CertificateFactory (java.security.cert)
engineLoad:764, JavaKeyStore (sun.security.provider)
engineLoad:243, KeyStoreDelegator (sun.security.util)
load:1479, KeyStore (java.security)
loadKeyStore:365, TrustStoreManager$TrustAnchorManager (sun.security.ssl)
getTrustedCerts:313, TrustStoreManager$TrustAnchorManager (sun.security.ssl)
getTrustedCerts:55, TrustStoreManager (sun.security.ssl)
invoke0:-1, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:62, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect)
invoke:566, Method (java.lang.reflect)
afterRegistration:65, TrustStoreManagerFeature (com.oracle.svm.core.jdk)

The Provider.load method looks like:

public Provider load(String pn) {
  if (debug != null) {
        debug.println("Attempt to load " + pn + " using SL");
    }
    Iterator<Provider> iter = services.iterator();
    while (iter.hasNext()) {
        try {
            Provider p = iter.next();
            String pName = p.getName();
            if (debug != null) {
                debug.println("Found SL Provider named " + pName);
            }
            if (pName.equals(pn)) {
                return p;
            }
        } 
...

pn here is "SunEC", so the iterator continues to get the next provider until "SunEC". In other word, the iteration stops whenever "SunEC" is found. iter.hasNext() not only checks the next item in iterator, but also initializes and loads the corresponding Provider class. See the source code of actual hasNext method:

public boolean hasNext() {
    while (nextProvider == null && nextError == null) {
        // get next provider to load
        while (!iterator.hasNext()) {
            if (currentLoader == null) {
                return false;
            } else {
                currentLoader = currentLoader.getParent();
                iterator = iteratorFor(currentLoader);
            }
        }

        // attempt to load provider
        ServiceProvider provider = iterator.next();
        try {
            @SuppressWarnings("unchecked")
            Provider<T> next = (Provider<T>) loadProvider(provider);
            nextProvider = next;
        } catch (ServiceConfigurationError e) {
            nextError = e;
        }
    }
    return true;
}

It is called with the following stack trace since the top frame of the previous stack trace:

hasNext:1076, ServiceLoader$ModuleServicesLookupIterator (java.util)
hasNext:1299, ServiceLoader$2 (java.util)
hasNext:1384, ServiceLoader$3 (java.util)
load:338, ProviderConfig$ProviderLoader (sun.security.jca)

The iterator is an instance of java.util.concurrent.CopyOnWriteArrayList.COWIterator with contents of a list of 6 java security providers:

sun.security.jgss.SunProvider
org.jcp.xml.dsig.internal.dom.XMLDSigRI
com.sun.security.sasl.gsskerb.JdkSASL
sun.security.pkcs11.SunPKCS11
sun.security.smartcardio.SunPCSC
sun.security.ec.SunEC

This list is set at PlatformClassloader initialization time and stored in ClassLoader.classLoaderValueMap field. But the list sequence is variant. I didn't find the initialization code and don't know the exact reason. But the variant of the sequence leading to the variant of reached security classes. When sun.security.ec.SunEC is at the head of this list, no other providers will be reached. But when it is at the tail of the list, all the other 5 providers are reached. The actually affected classes are not only these 5 Providers. Many other classes are reached by initializing and loading each provider.

Get back to Substrate VM's method com.oracle.svm.core.jdk.TrustStoreManagerFeature.afterRegistration. https://github.com/oracle/graal/blob/cf4cf49c7858b367cadfe08b534eccfd08d6b197/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jdk/Target_sun_security_ssl_TrustStoreManager.java#L80 Although XMLDSigRI is explicitly initialize at build time, the class may not actually be reached and got initialized.

oubidar-Abderrahim commented 3 years ago

Issue tracked internally on GR-28835

marjanasolajic commented 3 years ago

I'm currently investigating this. I noticed the same problem when I run benchmarks that are using security providers. For now, if we want to use JDK11 and have stable results, we can add native-image flag -H:+EnableAllSecurityServices. This is certainly not a permanent solution, but it works.

@ziyilin I ran your demo with this and got stable results for every run, 5388 for BUILD_TIME and 196 for RUN_TIME.

ziyilin commented 3 years ago

Thanks, good to know that. I think -H:+EnableAllSecurityServices option initializes all security providers so it becomes stable, and the initialized classes number also grows.

marjanasolajic commented 3 years ago

Yes, that's the downside of using this flag, but at least we have stable numbers if we are testing something and need to compare between runs.

oubidar-Abderrahim commented 6 months ago

This should be fixed