jakartaee / expression-language

Jakarta Expression Language
https://eclipse.org/ee4j/el
Other
70 stars 50 forks source link

bugfix: added clearProperties(ClassLoader) method #215

Closed lprimak closed 2 months ago

lprimak commented 5 months ago

to BeanELResolver to help remove ClassLoader leaks

fixes #214

Relates to https://github.com/payara/Payara/pull/6677

  /**
     * Clears properties cache by class loader.
     * To be called from the Container to clear the cache when ClassLoader is no longer in use
     * <p>
     * Even though {@link #properties} map is using {@link SoftReference} to store
     * its cache content, GC may not clean the cache on its own. This is because
     * the cache stores Class types, which hold a reference to its ClassLoader,
     * which in turn is much heavier than the garbage collector believes.
     * There may not be enough memory pressure to remove the Class element,
     * and GC may not be able to clear the ClassLoader without help from this method.
     *
     * @param classLoader
     */
markt-asf commented 5 months ago

You are going to need to provide a test case that demonstrates a memory leak. And once we have such a test case the fix will likely be to address the memory leak rather than adding a method that clears the cache.

lprimak commented 5 months ago

Screen Shot 2024-05-02 at 3 49 44 PM

lprimak commented 5 months ago

Hi, Mark,

I modified the description to show exactly what's happening. Also uploaded screenshot from VisualVM. It's clear that GC isn't collecting ClassLoaders while they are in the cache. The only option is to clear the cache.

markt-asf commented 5 months ago

I'm still not seeing the evidence of a memory leak. Soft references are GC eligible. For a memory leak I'd expect to see a chain of hard references from the leaked object to a GC root.

lprimak commented 5 months ago

SoftConcurrentHashMap.map contains Class<?> ask key, which in turn has a hard reference to the ClassLoader BeanProperties also has a hard reference to Class<?>

I have done a number of tests and the method in the PR is the minimum necessary to avoid the memory leak.

lprimak commented 4 months ago

@markt-asf Any thoughts please? Thank you

markt-asf commented 4 months ago

There is no memory leak here (confirmed with a profiler). There is nothing to fix.

Absent a test case demonstrating an actual memory leak, this PR and the associated issue will be closed.

lprimak commented 4 months ago

Hi, Mark,

The memory leak is clearly demonstrated by the VisualVM screenshot and my comments above. I spent 10s of volunteer hours tracking this down, fixing it and verifying the fix works. Is there some downside to this code that I am not seeing? The "worst-case outcome" of this fix is that it does nothing. Isn't that a good enough reason to merge this in?

This way, if I am right about this, the fix is already there :)

Any help convincing Mark would be appreciated @smillidge @arjantijms @edburns @ivargrimstad Thank you

lprimak commented 4 months ago

Hi, Mark,

@edburns would like me to produce a video with the memory leak. Would that be helpful? Do you have any questions or suggestions? Is there anything above that's not clearly demonstrating the memory leak?

Thank you

markt-asf commented 4 months ago

I would very much prefer a test case that demonstrates a leak - a leak being a hard reference to an object that isn't eligible for GC that should be eligible for GC.

The information provided so far only shows a reference chain that contains a soft reference to the alleged leaking class. That is not a memory leak.

Based on the information provided so far, I have been unable to recreate a memory leak via use of the BeanELResolver.

My experience of videos (mainly in security vulnerability reports) is not good. The issues are:

You are of course free to provide a video if you wish and it will get looked at. However, I continue to believe the most effective way for you to demonstrate that there is an issue is to provide a test case that demonstrates the leak. I appreciate that creating unit tests for memory leaks is not always easy. The simplest possible web application (with source code) that demonstrates a leak that can be observed in a profiler would be sufficient. Given that I don't see how you can provide a video without first writing such a test case, my recommendation continues to be "provide the test case".

Further:

The proposed fixed is highly unlikely to be applied in any circumstances. Absent evidence of a memory leak, cruft is not going to be added to the API "just in case". If evidence of a memory leak is provided, then the fix is almost certainly going to be to fix the actual leak - not to provide a hack to work-around the leak.

scottmarlow commented 4 months ago

@markt-asf Would a Java process memory dump (e.g. hprof file) help demonstrate the leak? I personally prefer a memory dump over a picture or a video.

lprimak commented 4 months ago

@markt-asf Looks like you have tried to reproduce the memory leak yourself. Can you put that code up so I can take a look? Maybe I can modify it to demo the leak rather than start from scratch? Thank you

lprimak commented 4 months ago

Also, you claim that all references are soft. static private class SoftConcurrentHashMap extends ConcurrentHashMap<Class<?>, BeanProperties> { the above line has a hard key to Class<?> Why do you think this will be garbage collected?

markt-asf commented 4 months ago

Why do you think this will be garbage collected?

Because I have read the code - particularly the cleanup() method and when it is called - and tested that it behaves as expected and clears both the key and the value from the cache.

The test code I have been using:

package org.apache.markt;

import jakarta.el.BeanELResolver;
import jakarta.el.ELContext;
import jakarta.el.ELResolver;
import jakarta.el.FunctionMapper;
import jakarta.el.VariableMapper;

public class TestBeanELResolver {

    /*
     * Need to run this with the following:
     * -Xmx16m
     * -XX:SoftRefLRUPolicyMSPerMB=100
     */
    public static void main(String[] args) throws Exception {
        @SuppressWarnings("unused")
        Object o = new Object();

        // Create the BeanELResolver instance to be tested
        BeanELResolver r = new BeanELResolver();

        // Use the resolver with TesterBeanA. This adds TesterBeanA to BeanELResolver.properties
        resolveA(r);

        /*
         * Now sleep for longer than heap size in MB (16) * SoftRefLRUPolicyMSPerMB (100). This ensures that, when there
         * is memory pressure, any soft references that exist at the moment - and specifically the ones in
         * BeanELResolver.properties - will be cleared.
         */
        Thread.sleep(2000);

        /*
         * Create memory pressure.
         * 
         * A call to System.gc() is *not* sufficient to clear the soft references.
         */
        o = null;
        try {
            o = new int[10][10][10][10][10][10][10][10][10][10][10][10];
        } catch(OutOfMemoryError err) {
            // Ignore
        }

        /*
         * At this point the soft reference has been removed from the map and placed on the queue. Trigger the cleanup()
         * method. Do this by using the resolver with a different bean.
         */
        resolveB(r);

        /*
         * Placing a break point here and examining the heap with a profiler will show that references to TesterBeanA
         * have been removed from BeanELResolver.properties
         * 
         * Note that the profiler will still show a hard reference from the class loader to TesterBeanA since the
         * class loader retains a reference to all the classes it has loaded. 
         */
        System.out.println("debug point");
    }

    private static void resolveA(ELResolver r) {
        TesterBeanA b = new TesterBeanA();
        b.setData("test");

        System.out.println(r.getValue(new TestContext(), b, "data"));
    }

    private static void resolveB(ELResolver r) {
        TesterBeanB b = new TesterBeanB();
        b.setData("test");

        System.out.println(r.getValue(new TestContext(), b, "data"));
    }

    private static class TestContext extends ELContext {

        @Override
        public ELResolver getELResolver() {
            return null;
        }

        @Override
        public FunctionMapper getFunctionMapper() {
            return null;
        }

        @Override
        public VariableMapper getVariableMapper() {
            return null;
        }
    }

    public static class TesterBeanA {

        private String data;

        public String getData() {
            return data;
        }

        public void setData(String data) {
            this.data = data;
        }
    }

    public static class TesterBeanB {

        private String data;

        public String getData() {
            return data;
        }

        public void setData(String data) {
            this.data = data;
        }
    }
}
lprimak commented 4 months ago

Looking at your test, I can see why you would think it isn't a leak. However, it absolutely is a leak for all intents and purposes:

The test above introduces memory pressure before forcing cleanup() to be called, thus it does clean out the queue (and the attached ClassLoader)

However, this is not a realistic situation, as memory pressure usually happens after the cleanup() method gets called, not before. Memory pressure stays active, without an additional beans being resolved, thus cleanup() doesn't get called often enough to clean up the cache and thus is ineffective at cleaning up the class loaders. This will cause OOM if during application runtime, application will ask for more memory, without resolving additional beans and thus calling cleanup(), which is what I was observing in my real-world tests.

Saying all that, WeakReference solution will probably work in this case, and is probably better solution than the manual cleanup proposed here.

However, I'd like to keep this PR open until the WeakReference method gets incorporated and proven to work in real-world tests.

The risk of WeakReference implementation is that it would be purged from the cache too early and would actually make the cache ineffective. I do not believe that will be the case, but without testing, who knows?

markt-asf commented 4 months ago

This PR really needs to be closed and this discussion moved to the issue.

It does seem unlikely to me that there would be enough bean usage to cause notable memory usage, then some other activity that triggers GC and then no further bean usage until a OOME occurs. Unlikely, but I guess it is possible. I do wonder if some other memory leak was at play as well.

There are a couple of options here.

The simplest may be to switch using class name rather than class as the key. That would reduce the impact of what was left behind before cleanup() is called.

Switching to a WeakReference based cache of some form is probably the better solution although it is higher risk. The Tomcat implementation has been used successfully since Java EE 5 but I'd need to check on where it originated. I know I didn't write it so I can't just copy it.

lprimak commented 4 months ago

I’ll make another PR with WeakReference as the change should be trivial

lprimak commented 3 months ago

It does seem unlikely to me that there would be enough bean usage to cause notable memory usage, then some other activity that triggers GC and then no further bean usage until a OOME occurs. Unlikely, but I guess it is possible. I do wonder if some other memory leak was at play as well.

The bean usage isn't the issue. The issue is that stale beans from undeployed applications drag in their ClassLoaders thus creating a huge leak. Beans are usually created and cached during application deployment. They are supposed to be unloaded during application undeployment. However, because they remain in the cache, and there is no memory pressure as-of yet, they do not get cleaned up. After a number of deploy/undeploy cycles, memory usage gets close to maximum, but not enough to clean up SoftReferences. After that, during regular application runtime it allocates memory, and because at some point it cannot, OOM occurs.

Even before that point, GC and memory thrashing occurs because there is a lot of uncollectable classes and garbage remaining.

What I am describing above is not a hypothetical. It has been observed and recorded via JFR, VisualVM, JMC, etc.

Solution is this PR is also not a hypothetical. With his applied, the OOM/GC thrashing issues are gone. Also observed by JFR/JMC/VisualVM

The simplest may be to switch using class name rather than class as the key. That would reduce the impact of what was left behind before cleanup() is called.

This will not work, because the cached bean (map value) contains a reference to application's object, which in turn will refer to it's ClassLoader, preventing it from being cleaned up.

lprimak commented 2 months ago

Superseded by #248