projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
984 stars 125 forks source link

Fix class/resource leak in `:nessie-quarkus:test` #9441

Closed snazy closed 1 week ago

snazy commented 3 weeks ago

Added a commit to #9433 to increase the max test-worker heap size to 3g. This makes the :nessie-quarkus:test task finish.

There is a resource leak, but that's already present with earlier Quarkus versions, so I suspect something that's present in all our tests is causing the leak, #9440 is one cause, but not the only one.

This issue is to investigate, find the root cause, fix it and bring the max test-worker heap size back to 2g.

snazy commented 3 weeks ago

So, good news first: there are

But there are ThreadLocals leaking, which are loaded by the Quarkus test class-loaders, which then cause "all the classes" and resources to leak. Some classes that add ThreadLocals are the following. There are more - but the ones that are initialized with a Supplier seem to be the problematic ones.

snazy commented 2 weeks ago

Discussion on Quarkus Zulip

snazy commented 2 weeks ago

The leak's caused by ThreadLocals attached by 3rd party libraries (Netty, Avro and some more) to the shared JUnit Test worker thread.

In JUnit, tests are run from the single Test worker thread (or a poll of threads, if JUnit is configured to do so). The QuarkusTestExtensions changes the thread's context-class-loader to use the correct Quarkus class loader (QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for ...).

If a test somehow "touches" a dependency that creates a thread local, that thread local is attached to the Test worker thread. Types in such a thread local reference the Quarkus test class loader.

Even though everything else is properly cleaned up before the next test w/ a different Quarkus test class loader starts, the thread locals are still referenced by the Test worker thread, which means that the objects/suppliers in those thread locals are still referenced, in turn the class loader and with that the whole set of classes. The result is eventually an OOM.

Here's an output that (should) illustrate the issue. It's generated by some hacky code that inspects the Test worker thread's threadlocals map.

THREAD LOCALS OF Thread[#1,Test worker,5,main]
jdk.internal.loader.ClassLoaders$PlatformClassLoader@42f5e9a6
   jdk.nio.zipfs.ZipFileSystem$IndexNode /org/assertj/core/error/UnsatisfiedRequirement.class , index: -1
QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for TestMetricsDisabled [V1] (QuarkusTest)@69b07727
   org.apache.avro.NameValidator$2 org.apache.avro.NameValidator$2@c8f14da
   org.apache.avro.Schema$$Lambda/0x00007feccca70000 org.apache.avro.Schema$$Lambda/0x00007feccca70000@5082ae8
   org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccce6af98 org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccce6af98@2f992272
QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for TestMetricsTags [V1] (QuarkusTest)@5cf93ce8
   io.netty.util.internal.InternalThreadLocalMap io.netty.util.internal.InternalThreadLocalMap@45d1828
   org.apache.avro.NameValidator$2 org.apache.avro.NameValidator$2@eecf044
   org.apache.avro.Schema$$Lambda/0x00007feccd87fd10 org.apache.avro.Schema$$Lambda/0x00007feccd87fd10@69e3d304
   org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccdc9a5a0 org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccdc9a5a0@3dd96fb3
jdk.internal.loader.ClassLoaders$AppClassLoader@76ed5528
   org.jboss.logmanager.ThreadLocalNDC$Stack 
QuarkusClassLoader:Quarkus Base Runtime ClassLoader: TEST for TestQuarkusRestInMemoryPersist [V1] (QuarkusTest)@681e913c
   org.apache.avro.NameValidator$2 org.apache.avro.NameValidator$2@69d2fc03
   org.apache.avro.Schema$$Lambda/0x00007feccb96ec18 org.apache.avro.Schema$$Lambda/0x00007feccb96ec18@423a5c9c
   org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccbe30fa8 org.projectnessie.cel.parser.Unescape$$Lambda/0x00007feccbe30fa8@2ef77fcc
snazy commented 2 weeks ago

Just documenting the extremely hacky code used to identify the issue...


public class TempBase {

  @BeforeAll
  public static void foo() throws Exception {
    System.out.println();
    System.out.println();
    System.out.println("*** START UP");
    System.out.println();
    System.out.println();

    dump();
  }

  @AfterAll
  public static void cleanup() throws Exception {
    System.out.println();
    System.out.println();
    System.out.println("*** CLEAN UP");
    System.out.println();
    System.out.println();

    dump();
  }

  private static void dump() throws Exception {
    Map<ClassLoader, List<Thread>> threadsByClassLoader = new IdentityHashMap<>();
    Thread t1 = null;
    for (Thread thread : Thread.getAllStackTraces().keySet()) {
      if (thread.threadId() == 1) {
        t1 = thread;
      }
      threadsByClassLoader
          .computeIfAbsent(thread.getContextClassLoader(), cl -> new ArrayList<>())
          .add(thread);
    }
    threadsByClassLoader.forEach(
        (cl, t) -> {
          if (cl == null) {
            return;
          }
          System.out.println(cl);
          t.sort(Comparator.comparing(Thread::getName).thenComparing(Thread::getId));
          for (Thread thread : t) {
            System.out.println("   " + thread.threadId() + " " + thread.getName());
          }
        });

    if (t1 != null) {
      dumpThreadLocals(t1);
    }

    Thread t = Thread.currentThread();
    if (t != t1) {
      dumpThreadLocals(t);
    }
  }

  private static void dumpThreadLocals(Thread t)
      throws NoSuchFieldException, IllegalAccessException {
    Map<ClassLoader, List<Object>> threadLocalsByClassLoader = new IdentityHashMap<>();
    Field threadLocalsField = Thread.class.getDeclaredField("threadLocals");
    threadLocalsField.setAccessible(true);
    Object threadLocals = threadLocalsField.get(t);
    Field tableField = threadLocals.getClass().getDeclaredField("table");
    tableField.setAccessible(true);
    Object[] entries = (Object[]) tableField.get(threadLocals);
    for (Object entry : entries) {
      if (entry != null) {
        Field valueField = entry.getClass().getDeclaredField("value");
        valueField.setAccessible(true);
        Object value = valueField.get(entry);
        if (value != null) {
          threadLocalsByClassLoader
              .computeIfAbsent(value.getClass().getClassLoader(), cl -> new ArrayList<>())
              .add(value);

          Field referentField = Reference.class.getDeclaredField("referent");
          referentField.setAccessible(true);
          Object referent = referentField.get(entry);
          if (referent != null
              && referent
                  .getClass()
                  .getName()
                  .equals("java.lang.ThreadLocal$SuppliedThreadLocal")) {
            Field supplierField = referent.getClass().getDeclaredField("supplier");
            supplierField.setAccessible(true);
            Object supplier = supplierField.get(referent);
            if (supplier != null) {
              threadLocalsByClassLoader
                  .computeIfAbsent(supplier.getClass().getClassLoader(), cl -> new ArrayList<>())
                  .add(supplier);
            }
          }
        }
      }
    }
    System.out.println("THREAD LOCALS OF " + t);
    threadLocalsByClassLoader.forEach(
        (cl, objs) -> {
          if (cl == null) {
            return;
          }
          System.out.println(cl);
          objs.sort(
              Comparator.comparing(o -> o.getClass().getName())
                  .thenComparing(System::identityHashCode));
          for (Object obj : objs) {
            System.out.println("   " + obj.getClass().getName() + " " + obj);
          }
        });
  }
}
snazy commented 2 weeks ago

Related JUnit issue