grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.4k stars 3.83k forks source link

Graceful shutdown within servlet container results in NoClassDefFoundError being thrown asynchronously #9510

Open jnehlmeier opened 2 years ago

jnehlmeier commented 2 years ago

What version of gRPC-Java are you using?

io.grpc:grpc-netty-shaded:1.46.0

What is your environment?

macOS / Linux / JDK 15 / Jetty 10

What did you expect to see?

No exception being thrown after a graceful shutdown. Graceful shutdown happens during undeployment of web application.

What did you see instead?

The log indicates that graceful shutdown has succeeded and consequently the web application has been undeployed. However, roughly a second later Netty throws NoClassDefFoundException seen below. Seems like channel.awaitTermination() returns too early.

06.09.2022 14:07:47.465 | grpc-shared-destroyer-0 | ERROR io.grpc.internal.LogExceptionRunnable 
Exception while executing runnable io.grpc.internal.SharedResourceHolder$2@58725740
java.lang.ClassNotFoundException: io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise$1
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:606)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:168)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:511)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 17 common frames omitted
Wrapped by: java.lang.NoClassDefFoundError: io/grpc/netty/shaded/io/netty/util/concurrent/DefaultPromise$1
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:499)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.ensureThreadStarted(SingleThreadEventExecutor.java:961)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:660)
    at io.grpc.netty.shaded.io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:163)
    at io.grpc.netty.shaded.io.grpc.netty.Utils$DefaultEventLoopGroupResource.close(Utils.java:439)
    at io.grpc.netty.shaded.io.grpc.netty.Utils$DefaultEventLoopGroupResource.close(Utils.java:402)
    at io.grpc.internal.SharedResourceHolder$2.run(SharedResourceHolder.java:139)
    at io.grpc.internal.LogExceptionRunnable.run(LogExceptionRunnable.java:43)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
    at java.base/java.lang.Thread.run(Thread.java:832)

Steps to reproduce the bug

Use a servlet container, e.g. Jetty, and register a ServletContextListener that starts and stops a GRPC client to call an external service. When undeploying the web application, the exception is sometimes thrown.

public final class MyGrpcServiceClient {

  private ManagedChannel channel;

  public void start() {
    checkState(channel == null, "already started");
    chanel = ManagedChannelBuilder.forTarget("dns:///.....").usePlaintext().build();
  }

  public void stop() {
    checkState(channel != null, "already stopped")
    channel.shutdown();
    try {
      if (!channel.awaitTermination(1, TimeUnit.MINUTES)) {
        // log warning that graceful shutdown has taken too long
      }
    } catch (InterruptedException e) {
      // log that graceful shutdown was interrupted
    } finally {
      channel = null;
    }
  }

  // ... factory methods for returning new GRPC service stubs using above channel ...
}

public final class MyGrpcServiceClientSCL implements ServletContextListener {

  private MyGrpcServiceClient client = new MyGrpcServiceClient();

  @Override
  public void contextInitialized(ServletContextEvent e) {
    try {
      client.start();
      // log successful start of service client
    } catch (Exception e) {
      throw new Error(e);
    }
  }

  @Override
  public void contextDestroyed(ServletContextEvent e) {
    try {
      client.stop();
      // log successful shutdown of service client
    } catch (Exception e) {
      // log unexpected exceptions
    }
  }

}
sanjaypujare commented 2 years ago

There might be an existing issue but I could not find it. @ejona86 probably has more details.

ejona86 commented 2 years ago

There may be another one, but #4893 is the same.

ejona86 commented 2 years ago

8309 is the one I remembered.

sanjaypujare commented 2 years ago

Thanks @ejona86 . I see similarities with #8309. The comment https://github.com/grpc/grpc-java/issues/8309#issuecomment-881654181 might apply here i.e. "I suspect that if you add a Thread.sleep(1100) after the await termination the error will go away"

jnehlmeier commented 2 years ago

Thanks for your responses. As additional information I can tell you that I have upgraded grpc-java from version 1.15.0 to 1.46.0 and only after that update this issue appeared. Also while using 1.15.0 the code wasn't fully correct because it did never call awaitTermination. So during upgrade to 1.46.0 I also added the awaitTermination call. I know that 1.15.0 is 4 years old and a lot has probably changed since then. Just wanted to add that information for completeness and to let you know that you did not had that problem in the past (even without calling awaitTermination)

The comment https://github.com/grpc/grpc-java/issues/8309#issuecomment-881654181 might apply here i.e. "I suspect that if you add a Thread.sleep(1100) after the await termination the error will go away"

Honestly I would expect grcp-java to implement Thread.sleep() inside awaitTermination. I know it would feel hack'ish to you but it would feel even more hack'ish to me because then I have some magic number in my code that originates from some implementation detail deep inside grpc-java. So I really want to avoid that.

As a library user I expect awaitTermination to do its job and not cause any asynchronous errors. Currently our automated error reporting ships this exception regularly from multiple app server instances. So it is quite annoying.

sanjaypujare commented 2 years ago

@ejona86 could you comment on the above?

jnehlmeier commented 1 year ago

Same situation but different exception(s):

06.01.2023 11:28:11.536 | grpc-default-worker-ELG-1-2 | ERROR i.g.n.s.i.n.u.c.D.rejectedExecution 
Failed to submit a listener notification task. Event loop shut down?
java.lang.ClassNotFoundException: io.grpc.netty.shaded.io.netty.util.concurrent.GlobalEventExecutor$2
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:606)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:168)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:511)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 11 common frames omitted
Wrapped by: java.lang.NoClassDefFoundError: io/grpc/netty/shaded/io/netty/util/concurrent/GlobalEventExecutor$2
    at io.grpc.netty.shaded.io.netty.util.concurrent.GlobalEventExecutor.startThread(GlobalEventExecutor.java:217)
    at io.grpc.netty.shaded.io.netty.util.concurrent.GlobalEventExecutor.execute(GlobalEventExecutor.java:205)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:841)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:499)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:605)
    at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setSuccess(DefaultPromise.java:96)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:1048)
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:832)

Just before the above exception the exception below is logged several times as WARN

06.01.2023 11:28:11.535 | grpc-default-worker-ELG-1-8 | WARN  i.g.n.s.i.n.u.c.SingleThreadEventExecutor 
Unexpected exception from an event executor: 
java.lang.NoClassDefFoundError: io/grpc/netty/shaded/io/netty/util/collection/IntObjectHashMap$2
    at io.grpc.netty.shaded.io.netty.util.collection.IntObjectHashMap.values(IntObjectHashMap.java:221)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.closeAll(EpollEventLoop.java:433)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:401)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:832)

Has any work been done in newer GRPC versions to prevent that behavior and to make awaitTermination work as expected?

jnehlmeier commented 1 year ago

Thanks @ejona86 . I see similarities with #8309. The comment #8309 (comment) might apply here i.e. "I suspect that if you add a Thread.sleep(1100) after the await termination the error will go away"

Would you accept a PR that adds a Thread.sleep(1100) (or a bit higher) to awaitTermination as a workaround for this issue until a proper solution has been developed?

ejona86 commented 1 year ago

Would you accept a PR that adds a Thread.sleep(1100) (or a bit higher) to awaitTermination

No, because it isn't a per-Channel problem. If you create 100 channels, the sleep would only do something for the last Channel being shut down. Unless you were creating another Channel right after that.

I would accept someone filing a bug with Jetty that their ClassLoader prevents shutdown. Breaking the ClassLoader is such a weak protection that it is basically useless, yet it is causing breakages.

jnehlmeier commented 1 year ago

I would accept someone filing a bug with Jetty that their ClassLoader prevents shutdown. Breaking the ClassLoader is such a weak protection that it is basically useless, yet it is causing breakages.

Well I don't think there is any Jetty issue. Jetty sees that the web application has successfully shutdown and thus destroys the ClassLoader for that web application.

I don't know how grpc-java works internally but if you say that it is not a per-Channel problem then I assume that you start Netty once for all channels created and even if all channels have been terminated using awaitTermination Netty is still running for some time? Maybe grpc-java then needs some additional shutdown methods (or maybe they already exist?) that only returns if the whole Netty server with all GRPC channels is completely shutdown?

ejona86 commented 1 year ago

I've looked enough at Jetty's WebAppClassLoader that I don't think there's much to change there. I thought there was something done on purpose here, but instead I now think they are probably just deleting the wars/jars and that causes the problem.

On the grpc side, this is because we share the Netty threads across channels/servers. That by itself might be okay. But to avoid rapidly re-creating the resources if the reference count flaps rapidly between 1 and 0, we delay the shutdown by a second.