oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.38k stars 1.64k forks source link

LinkedHashMap memory leak #3260

Closed eduveks closed 3 years ago

eduveks commented 3 years ago

Hello,

I'm facing memory leak of many LinkedHashMap with GraalVM.

This behaviour is not new because in the last years I am using GraalVM always seeing memory issues, but now is more critical because we are using in situations with much more stress overall.

Anyway GraalVM is awesome, thank you a lot by all your work. 👏

Hope this report can help to improve.

Screenshot 2021-03-06 at 16 27 49

Screenshot 2021-03-06 at 16 28 09

Describe GraalVM and your environment :

eduveks commented 3 years ago

After more 20 minutes looks like this:

Screenshot 2021-03-06 at 16 48 04

Screenshot 2021-03-06 at 16 48 19

munishchouhan commented 3 years ago

@eduveks thanks for reporting the issue we will check it and get back to you

chumer commented 3 years ago

@eduveks thanks for the report!

Are you reloading classes e.g. using a custom host class loader or some application server? It looks to me as if our host member/method lookup cache seems to be missed a lot.

If this is true then the current behavior might be not ideal but is expected. We currently do reference bound host methods strongly. A workaround is to recreate the Engine (or Context if you don't use an explicit Engine) to refresh that cache. Note that this also refreshes all the compiled code. If this turns out to be the case, then I could have another look on improving this, if you provide a small example of your usage pattern.

If you are not reloading classes in your host application then it would be immensely helpful if you could provide a sample application that reproduces this leak reliably or, less preferrably, a heap dump (hprof) file of the leak.

Thanks!

eduveks commented 3 years ago

Hello @chumer,

I did another round of stress test, with screenshots below and here the hprof file made a few minutes ago:

Yes, it is an application server, the Netuno platform, with a Jetty server inside and runs over GraalVM.

You can check out Netuno install is straightforward, only need to execute one command in the terminal.

The installation will download and install GraalVM too inside the netuno folder.

Then here an example of how to create services.

NodeJS inside GraalVM execute the javascript server-side (services), then the script code like _out.json, _db.query and much more, they will call Java methods.

Each request executes multiples scripts (JavaScript) over the same Class that initializes the HostAccess, ContextBuilder, and then the Context, but in the same request may create more than one Context. But always all Contexts initialized will be closed in the end of the request with context.leave() and context.close(true).

Multi-threads do not use the same HostAccess and ContextBuilder, each thread/request has your own org.graalvm.polyglot.* instances.

If you need more details just say.

Thank you!

Screenshot 2021-03-30 at 17 04 50

chumer commented 3 years ago

I downloaded the heap dump and computed the retained size and the retained size looks like this:

Screenshot 2021-04-06 at 15 19 41

So it looks like if all of the references can actually be freed. So I don't think there is anything leaking in that heap dump. Why you are seeing the GC pressure that you are seeing, I cannot really explain. Did you try forcing a full GC with System.gc form time to time?

Each request executes multiples scripts (JavaScript) over the same Class that initializes the HostAccess, ContextBuilder, and then the Context, but in the same request may create more than one Context. But always all Contexts initialized will be closed in the end of the request with context.leave() and context.close(true).

Two remarks here:

[1] https://www.graalvm.org/reference-manual/embed-languages/#code-caching-across-multiple-contexts

eduveks commented 3 years ago

I think that I already tested the implementation with code caching across multiple contexts in the past.

But yesterday, I did some tests changing the implementation to supports the same Engine across all threads and with one Engine per thread.

The first impression in both cases was memory ate fast and very slow performance, scripts executed before in 300 milliseconds jumped to 20 seconds, CPU with high usage too.

I will try to do more tests within the next few days to try to give more details.

Yes, periodically, we force the garbage collector with:

System.gc();
System.runFinalization();

Sample of the first results with the Engine:

Screenshot 2021-04-06 at 22 47 17

chumer commented 3 years ago

Any chance you can give me something runnable that reproduces this problem?

eduveks commented 3 years ago

Any chance you can give me something runnable that reproduces this problem?

Yes, I think. I don't have time today, but I will give you more news in the next few days.

thurka commented 3 years ago

I think I know what is going on here. I was investigating, why there are so many instances of java.lang.reflect.Method in the heap dump. I tracked almost all the references to various instances of java.lang.ClassValue$ClassValueMap and those instances does not have any references so it looks like they can be collected by GC, but instances of java.lang.ClassValue$ClassValueMap are references from java.lang.Class.classValueMap field so they are held by the particular class, but this information is not dumped to hprof file. :-( The question is why java.lang.ClassValue$ClassValueMap holds so many java.lang.reflect.Method. I don't know. If you look at the source code of ClassValueMap, there is some caching, so it might be one reason. I also saw this: https://stackoverflow.com/questions/62593886/java-memory-leak-why-are-unreachable-objects-not-collected This looks very similar to our case and in addition it uses scripting API to call javascript. I have netuno server running with some demo application and just randomly clicking pages, reloading pages etc., I am able to see steady rise of instances of Method.

thurka commented 3 years ago

This is example of references from Method to ClassValueMap.

refs
thurka commented 3 years ago

Small heap dump for investigation: GH-3260_heapdump.zip .

thurka commented 3 years ago

Steps to reproduce:

  1. get off Cisco VPN
  2. set JAVA_HOME to JDK 11 (installation below needs nashorn, so it will not work on JDK 15+)
  3. install netuno in /tmp
    thurka$ cd /tmp
    thurka$ mkdir -p netuno && cd netuno && curl https://www.netuno.org/netuno.jar > netuno.jar && java -jar netuno.jar install
  4. add node to PATH
    thurka$ export PATH=/tmp/netuno/graalvm/bin/:$PATH
  5. start netuno server with the default application via ./netuno server
  6. fill in email address
  7. it will open web browser in Portuguese locale(???), you have to manually register the netuno server
  8. it will take some time, since NPM Install is running for the first time
  9. play with application
  10. start VisualVM and open notuno process
  11. switch to Sampler tab and select Memory sampling
  12. sort data by Live Object
  13. look for number of java.lang.reflect.Method instances
  14. the number is steadily climbing while I play with netuno app in browser
  15. you can use Show delta values button to see deltas in live object count
  16. you can invoke GC directly from Memory sampler view to free unreferenced instances
chumer commented 3 years ago

I did not yet run your code yet, but I think I found a problematic pattern in our code:

public class Test {

    /*
     * Change to false to avoid the bug.
     */
    private static boolean CYCLIC_REFERENCE = true;

    private static class HeavyValue {

        final Context context;
        final byte[] bytes = new byte[1024 * 16];

        HeavyValue(Context context) {
            // the context assignment
            if (CYCLIC_REFERENCE) {
                this.context = context;
            } else {
                this.context = null;
            }
        }

    }

    private static class Context {

        final ClassValue<HeavyValue> values = new ClassValue<Test.HeavyValue>() {
            @Override
            protected HeavyValue computeValue(Class<?> type) {
                return new HeavyValue(Context.this);
            }
        };

    }

    static class TestClass1 {

        }
    static class TestClass2 {

        }
    static class TestClass3 {

    }
    static class TestClass4 {

    }
    static class TestClass5 {

    }
    static class TestClass6 {

    }

    private static final Class<?>[] TEST_CLASSES = new Class[] {
            TestClass1.class,
            TestClass2.class,
            TestClass3.class,
            TestClass4.class,
            TestClass5.class,
            TestClass6.class
    };

    private static int test() {
        Context context =  new Context();

        int hash = 31;
        for (Class<?> testClass : TEST_CLASSES) {
            hash += context.values.get(testClass).bytes.length; 
        }
        return hash;
    }

    public static void main(String[] args) {
        int hash = 31;
        for (int i = 0; i < 100000; i++) {
            hash += test();
        }
        System.out.println(hash);

    }

}

Switching that cyclic reference flag to false fixes the problem (tested on latest JDK 8 on Mac). But this should really not happen, as such cyclic references should be resolvable by the GC. I will double check and maybe file a JDK bug. We probably need to deploy a workaround for this.

I need to do some more digging to find out what is going on.

chumer commented 3 years ago

Turns out this is a known issue: https://bugs.openjdk.java.net/browse/JDK-8169425

thurka commented 3 years ago

It looks like every page reload in the web app adds several instances of com.oracle.truffle.polyglot.HostClassCache.

chumer commented 3 years ago

I could reproduce it with a simple example now locally. It looks like there is a simple workaround. You need to reuse the same HostAccess instance for all contexts. I am working on a fix that avoids a the Cyclic dependency.

chumer commented 3 years ago

Fix is in the pipeline. Will backport this to 21.1 and probably also 20.3 (need to double check)

eduveks commented 3 years ago

Thank you, @thurka, for your inputs, and I'm happy you could found the issue with Netuno.

@chumer the workaround reusing the HostAccess across all contexts does not affect overall performance, but in this case, the memory leak still happens. See the print screen below:

Screenshot 2021-04-09 at 12 26 00 Screenshot 2021-04-09 at 12 26 10

To complement the @thurka tip to run locally, if install and start Netuno just with ./netuno server based on the default configs will run the demo app, then it is possible to execute Apache BenchMark (web stress) over one of these services:

ab -n 100 -c 5 http://127.0.0.1:9000/services/samples/javascript/print-template ab -n 100 -c 5 http://127.0.0.1:9000/services/samples/javascript/date-format ab -n 100 -c 5 http://127.0.0.1:9000/services/samples/javascript/query-result

The first one, print-template, seems to provoke a higher peak.

This test with Apache BenchMark and the demo application are very different from all tests I had reported before, but I think it is over the same internal issue.

My results with Apache BenchMark executed over 5 times:

Screenshot 2021-04-09 at 13 50 20

Screenshot 2021-04-09 at 13 50 33

chumer commented 3 years ago

@eduveks This is really weird, in my local tests using the same HostAccess instance for all contexts resolved the problem. As the cache is bound and reused for HostAccess instances. So it should be impossible to have the same method loaded twice. Are you sure you are not just building an equal HostAccess instance? To make sure, can you read the HostAccess instance from a static final? Maybe you can drop a snippet how you are doing that atm, so I can plug it into my test?

eduveks commented 3 years ago

@chumer I did another test to ensure and is the same behavior with HostAccess as static.

This is the current full code of the integration with GraalVM:

package org.netuno.psamata.script;

import org.graalvm.polyglot.*;
import org.graalvm.polyglot.Context.Builder;

import java.util.*;

public class GraalRunner {
    private static boolean graal = false;

    //private static Engine engine = Engine.create();

    private static HostAccess hostAccess = null;

    private Builder contextBuilder = null;

    private Context context = null;

    private List<Context> oldContexts = new ArrayList<>();

    static {
        try {
            Context.create().close();

            hostAccess = HostAccess.newBuilder()
                .allowPublicAccess(true)
                .targetTypeMapping(
                        Value.class, Object.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                ).targetTypeMapping(
                        Value.class, List.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                ).targetTypeMapping(
                        Value.class, Collection.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                ).targetTypeMapping(
                        Value.class, Iterable.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                )
                .build();

            graal = true;
        } catch (Throwable t) {
            graal = false;
        }
    }

    public GraalRunner(String... permittedLanguages) {
        init(null, permittedLanguages);
    }

    public GraalRunner(Map<String, String> options, String... permittedLanguages) {
        init(options, permittedLanguages);
    }

    private void init(Map<String, String> options, String... permittedLanguages) {
        //engine = Engine.create();

        contextBuilder = Context.newBuilder(permittedLanguages)
                .allowExperimentalOptions(true)
                .allowIO(true)
                .allowAllAccess(true)
                .allowCreateThread(true)
                .allowHostAccess(hostAccess);

        //https://www.graalvm.org/truffle/javadoc/org/graalvm/polyglot/ResourceLimits.html
        //contextBuilder.resourceLimits(new ResourceLimits())

        if (options != null) {
            contextBuilder.options(options);
        }

        //contextBuilder.engine(engine);

        context = contextBuilder.build();
    }

    public void resetContext() {
        if (context != null) {
            oldContexts.add(context);
        }
        context = contextBuilder.build();
    }

    public void close() {
        for (Context oldContext : oldContexts) {
            try {
                oldContext.leave();
            } catch (Exception e) { }
            oldContext.close();
        }
        oldContexts.clear();
        oldContexts = null;
        try {
            context.leave();
        } catch (Exception e) { }
        context.close();
        context = null;
        //engine.close();
    }

    public static boolean isGraal() {
        return graal;
    }

    public GraalRunner set(String language, String var, Object value) {
        context.getBindings(language).putMember(var, value);
        return this;
    }

    public Set<String> keys(String language) {
        return context.getBindings(language).getMemberKeys();
    }

    public Object get(String language, String var) {
        return toObject(context.getBindings(language).getMember(var));
    }

    public boolean getBoolean(String language, String var) {
        return context.getBindings(language).getMember(var).asBoolean();
    }

    public byte getByte(String language, String var) {
        return context.getBindings(language).getMember(var).asByte();
    }

    public double getDouble(String language, String var) {
        return context.getBindings(language).getMember(var).asDouble();
    }

    public float getFloat(String language, String var) {
        return context.getBindings(language).getMember(var).asFloat();
    }

    public int getInt(String language, String var) {
        return context.getBindings(language).getMember(var).asInt();
    }

    public long getLong(String language, String var) {
        return context.getBindings(language).getMember(var).asLong();
    }

    public int getShort(String language, String var) {
        return context.getBindings(language).getMember(var).asShort();
    }

    public String getString(String language, String var) {
        return context.getBindings(language).getMember(var).asString();
    }

    public Value eval(String language, String code) {
        try {
            context.enter();
        } catch (Exception e) { }
        try {
            return context.eval(language, code);
        } finally {
            try {
                context.leave();
            } catch (Exception e) { }
        }
    }

    public static Map transformMembers(Value v) {
        Map map = new HashMap();
        for (String key : v.getMemberKeys()) {
            Value member = v.getMember(key);
            if (member.hasArrayElements() && !member.isHostObject()) {
                map.put(key, transformArray(member));
            } else if (member.hasMembers() && !member.isHostObject()) {
                map.put(key, transformMembers(member));
            } else {
                map.put(key, toObject(member));
            }
        }
        return map;
    }

    public static List transformArray(Value v) {
        List list = new ArrayList();
        for (int i = 0; i < v.getArraySize(); ++i) {
            Value element = v.getArrayElement(i);
            if (element.hasArrayElements() && !element.isHostObject()) {
                list.add(transformArray(element));
            } else if (element.hasMembers() && !element.isHostObject()) {
                list.add(transformMembers(element));
            } else {
                list.add(toObject(element));
            }
        }
        return list;
    }

    public static Object toObject(Value v) {
        if (v.isNull()) {
            return null;
        } else if (v.isHostObject()) {
            return v.asHostObject();
        } else if (v.isProxyObject()) {
            return v.asProxyObject();
        } else if (v.isBoolean()) {
            return v.asBoolean();
        } else if (v.isNumber()) {
            if (v.fitsInByte()) {
                return v.asByte();
            }
            if (v.fitsInShort()) {
                return v.asShort();
            }
            if (v.fitsInInt()) {
                return v.asInt();
            }
            if (v.fitsInLong()) {
                return v.asLong();
            }
            if (v.fitsInFloat()) {
                return v.asFloat();
            }
            if (v.fitsInDouble()) {
                return v.asDouble();
            }
        } else if (v.isString()) {
            return v.asString();
        } else {
            Value value = v.getMetaObject();
            if (value.fitsInByte()) {
                return value.asByte();
            }
            if (value.fitsInShort()) {
                return value.asShort();
            }
            if (value.fitsInInt()) {
                return value.asInt();
            }
            if (value.fitsInLong()) {
                return value.asLong();
            }
            if (value.fitsInFloat()) {
                return value.asFloat();
            }
            if (value.fitsInDouble()) {
                return value.asDouble();
            }
            try {
                return value.asBoolean();
            } catch (Exception e) {
                try {
                    return value.asString();
                } catch (Exception ex) {
                    return null;
                }
            }
        }
        return null;
    }
}
chumer commented 3 years ago

Any reason why you are doing:

try {
            context.enter();
        } catch (Exception e) { }
        try {
            return context.eval(language, code);
        } finally {
            try {
                context.leave();
            } catch (Exception e) { }
        }

With enter/leave? Any context operation should do an implicit enter/leave. In this case doing the enter and leave explicitely is actually slower.

Missing a leave on a particular thread can also potentially cause another leak. Is there any multi-threading happening for a GraalRunner instance?

eduveks commented 3 years ago

@chumer, my apologies, I detect a maven malfunction that did not change my code version in all of my tests. 🤦‍♂️

Finally, it is correct. Your workaround sharing the HostAccess works very well!

Sorry for my mistake.

This is the result of many stress, my hardest test, rolling smooth: 🥳

Screenshot 2021-04-12 at 17 50 20

Thank you very much! 🍻

eduveks commented 3 years ago

To clarify, I commented all context.enter() and context.leave(). I do not remember why but I think it was an attempt to avoid the leak.

Yes, just one instance of the GraalRunner per thread always.

And again, thank you for your support.

eduveks commented 3 years ago

With this, I remembered making another test sharing the same Engine with all threads, as @chumer said.

And the performance jumped to another level, much less memory and CPU consumption overall, even with a lot of stress is like nothing is going on.

Then the key is to share Engine and HostAccess over all threads. 🚀

Finally, the killer result:

Screenshot 2021-04-12 at 19 08 40

chumer commented 3 years ago

Happy to hear it all works out now as intended! :-)