hypfvieh / dbus-java

Improved version of java DBus library provided by freedesktop.org (https://dbus.freedesktop.org/doc/dbus-java/)
https://hypfvieh.github.io/dbus-java/
MIT License
185 stars 73 forks source link

NullPointerException in DBus Signal Receiver thread #172

Closed ghost closed 2 years ago

ghost commented 2 years ago

We're seeing a number of NullPointerExceptions in our app, but the source of the fault is unknown:

2022-07-08T09:21:55.986065+00:00 APP-277270 bash[369]: Exception in thread "DBus-Signal-Receiver-2" java.lang.NullPointerException
2022-07-08T09:21:55.987137+00:00 APP-277270 bash[369]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)
2022-07-08T09:21:55.987702+00:00 APP-277270 bash[369]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
2022-07-08T09:21:55.988312+00:00 APP-277270 bash[369]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
2022-07-08T09:21:55.988883+00:00 APP-277270 bash[369]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
2022-07-08T09:21:55.989399+00:00 APP-277270 bash[369]: #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2022-07-08T09:21:55.989955+00:00 APP-277270 bash[369]: #011at java.base/java.lang.Thread.run(Thread.java:829)

We're still looking into the problem. I'm wondering what would happen if the given _executor in the list of executors is missing from the following line:

https://github.com/hypfvieh/dbus-java/blob/048d47e7010b96f615b43c0cc8da858b7344e38d/dbus-java-core/src/main/java/org/freedesktop/dbus/connections/ReceivingService.java#L85

Could there be a race condition?

Any recommendations for adding diagnostics?

Also, it appears as though there are a lot of threads being created repeatedly:

2022-07-08T09:57:02.602337+00:00 APP-277270 bash[369]: Exception in thread "DBus-Signal-Receiver-1779245" java.lang.NullPointerException

I suppose this makes sense because we have a service that publishes 64 signals to the D-Bus every second. I'd have thought by using a thread pool that the number of actual threads created would be kept low.

hypfvieh commented 2 years ago

I don't think that this line is causing any problems. Anyways I will add some null checks just in case.

Debugging such problems is always problematic. I would start adding more logging to your code and also to the library if necessary. Also using a recent Java like 17 will help you with smart nullpointer and may point to the actual failure.

The thread spawning can be a side effect of the nullpointer. When the exception is thrown the executing thread will die and the executor will instantly start a new one.

ghost commented 2 years ago

Also using a recent Java like 17 will help you with smart nullpointer and may point to the actual failure.

We're running an overnight test using JDK 17 to see if the NPE pops up again. FWIW, we're publishing ~64 signals per second indefinitely. The source of the NPE is within a Java class; hopefully upgrading to JDK 17 doesn't indirectly fix the problem because that would make the underlying issue harder to isolate.

2022-07-08T09:21:55.987137+00:00 APP-277270 bash[369]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)

I'll add more details when we know more about the problem.

Meanwhile, a new release with additional null checks would be most welcome.

hypfvieh commented 2 years ago

I took a second look at this.

The "missing" Null-Check you mentioned should never cause any harm because the ThreadPoolExecutors which will be fetched from the EnumMap queried at that point are already initialized in the constructor of ReceivingService.

If the returned ExecutorService would have been null the NPE would have been thrown at that very line - but it is thrown in the code of LinkedBlockingQueue after the ExecutorService tried to get the next job in the ExecutorService internal Runnable queue.

Supplying Runnables to the ExecutorService from multiple threads should be no problem.

I changed the Map-Implementation from EnumMap to ConcurrentHashMap, but I don't think that this will fix the issue. As said before the NPE is thrown in ThreadPoolExecutor / LinkedBlockingQueue, not in dbus-java code. The queue in the ThreadPoolExecutor is not changed by dbus-java except for adding more Runnables when more signals arrive by calling ThreadPoolExecutor.execute(Runnable).

ghost commented 2 years ago

Running overnight using BellSoft's JDK 17 on armhf failed to reproduce the issue. I could only find one mention of a similar issue, with openHAB:

In the first post, one person suggested bumping the thread pool size:

org.eclipse.smarthome.threadpool:ruleEngine=100

By default, ReceivingService uses a modest thread pool size of 1:

https://github.com/hypfvieh/dbus-java/blob/048d47e7010b96f615b43c0cc8da858b7344e38d/dbus-java-core/src/main/java/org/freedesktop/dbus/connections/ReceivingService.java#L22

We could try bumping the thread count:

return DBusConnectionBuilder
  .forSessionBus()
  .withSignalThreadCount(5)
  .build();

In the second post, the bug appeared against openHAB when using:

@ssalonen stated he had the same issue (using Linux lerbacka-raspi 5.10.17-v7l+ #1421 SMP Thu May 27 14:00:13 BST 2021 armv7l GNU/Linux), which was resolved by switching Zulu OpenJDK 11 to Adopt OpenJDK. I've asked him to confirm that it was a straight change without also upgrading to JDK 17.

Everything seems to be stable with AdoptOpenJDK, installation has been up 5 weeks now without any issues.

We're using JDK 11 on 32-bit ARM (armhf), as opposed to arm7l.

BellSoft's JDK 17 ran cleanly, so we're going to try another run using Adoptium JDK 17.

ssalonen commented 2 years ago

I switched from Zulu 11 to adoptopenjdk-11-hotspot-jre (currently 11.0.11+9-). So both are JVM 11. After thus change, there have been no issues at all. I have been running with this setup for months now, although there are occasional reboots (power cuts) every month or two.

Indeed I have arm7l, I believe this is 32 bit version of raspbian:

pi@lerbacka-raspi:~ $ cat /etc/debian_version 10.11 pi@lerbacka-raspi:~ $ uname -a Linux lerbacka-raspi 5.10.63-v7l+ #1496 SMP Wed Dec 1 15:58:56 GMT 2021 armv7l GNU/Linux pi@lerbacka-raspi:~ $ dpkg-architecture -q DEB_BUILD_ARCH armhf

Let me know if you need any other information

EDIT: I noticed that indeed the whole Adoptium change has happened in the past months. I am using the version from the original AdoptOpenJDK apt repos, I guess they are not updated yet...

hypfvieh commented 2 years ago

I have to admit that I do not test dbus-java with other JDKs then the versions provided by Adoptium/AdoptOpenJDK. Also I only test it on x86_64 machines (I took over maintaince of dbus-java when playing with raspberry but that was long ago and I don't use this platform anymore).

One note to increasing signal thread count: If you do so, you might receive signals in an arbitrary order. There is no guarantee that the signals will be processed in the order they were received (due to the fact that each signal is handled by a different thread which may or may not be faster than another thread).

ghost commented 2 years ago

Thanks for adding the null check. I suspect the fault will still be present:

        ExecutorService exec = executors.get(_executor);
        if (exec == null) { // this should never happen, map is initialized in constructor
            throw new IllegalStateException("No executor found for " + _executor);
        } else if (closed || exec.isShutdown() || exec.isTerminated()) {
            throw new IllegalStateException("Receiving service already closed");
        }
        exec.execute(_r);

What appears to be happening is that, for whatever reason, executors.get(...) or exec.execute(_r) will throw a NullPointerException---possibly due to a JDK bug with respect to LinkedBlockingQueue. Checking for null won't resolve the issue because the exception would remain unhandled. Instead, I believe this needs to be something along the lines of:

  try {
    final ExecutorService exec = executors.get(_executor);

    if (closed || exec.isShutdown() || exec.isTerminated()) {
      throw new IllegalStateException("Receiving service already closed");
    }

    exec.execute(_r);
  }
  catch( final NullPointerException ex ) {
    // On armhf and arm7l, some JDK versions throw a NullPointerException when using the
    // LinkedBlockingQueue. Handle the exception and throw it up the chain.
    throw new IllegalStateException("Error retrieving or running executor for " + _executor, ex );
  }
ghost commented 2 years ago

Here's the code I'm running on armhf with Adopt OpenJDK 11 to try and recreate the bug:

import java.math.BigDecimal;
import static java.math.RoundingMode.HALF_UP;

import java.util.EnumMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Optional;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.ThreadFactory;

import java.util.concurrent.atomic.AtomicInteger;

public class ReceivingService {
  private final static int THREAD_POOL_SIZE = 1;

  private static class NameableThreadFactory implements ThreadFactory {
    private final ThreadGroup mGroup = Thread.currentThread().getThreadGroup();
    private final AtomicInteger mThreadCount = new AtomicInteger(1);
    private final String mPrefix;

    public NameableThreadFactory(final String prefix) {
        mPrefix = prefix;
    }

    @Override
    public Thread newThread(final Runnable r) {
      final var name = mPrefix + mThreadCount.getAndIncrement();
      final var t = new Thread(mGroup, r, name, 0L);

      t.setDaemon(true);
      t.setPriority(5);

      return t;
    }
  }

  private ExecutorService mExecutor = Executors.newFixedThreadPool(
    THREAD_POOL_SIZE,
    new NameableThreadFactory("DBus-Signal-Receiver-")
  );

  public ReceivingService() {}

  private void execSignalHandler(final Runnable r) {
    if( !(mExecutor.isShutdown() || mExecutor.isTerminated()) ) {
      mExecutor.execute( r );
      System.out.println( mExecutor );
    }
  }

  private static Runnable pi() {
    return () -> {
      final var TWO = new BigDecimal(2);
      final var NUMERATOR = new BigDecimal(4);

      var denominator = new BigDecimal(1);
      var pi = new BigDecimal(0);
      var out = 0;

      boolean plus = true;

      while( true ) {
        final var term = NUMERATOR.divide( denominator, 78, HALF_UP);

        pi = plus ? pi.add( term ) : pi.subtract( term );
        plus = !plus;
        denominator = denominator.add( TWO );

        if( out % 100000 == 0 ) {
          System.out.println( pi );
          out = 1;
        }
        else {
          out++;
        }

        Thread.yield();
      }
    };
  }

  public static void main( final String[] args ) throws Exception {
    final var service = new ReceivingService();
    final var i = new AtomicInteger();

    for( var p = 0; p < 10; p++ ) {
      final var processor = new Thread( pi() );

      processor.setPriority( Thread.MAX_PRIORITY );
      processor.start();
    }

    while( true ) {
      for( int signal = 0; signal < 64; signal++ ) {
        service.execSignalHandler( () -> { 
          final var handler = new Thread( () -> {
            i.incrementAndGet();
            System.out.println( i );
          });

          handler.start();
        });

        Thread.sleep(2);
      }

      Thread.sleep(1000);
    }
  }
}

I haven't managed to trigger the NPE after several hours of running.

hypfvieh commented 2 years ago

Catching the NPE and throwing something else is also no real fix for this problem. This is just replacing NPE for IllegalStateException, in both cases it is more or less unexpected and the runnable given will not be executed and the signal or whatever will never be handled.

The question is on how to deal with it. Dropping the runnable is not good in my opinion, because then someone will complain that signals are lost. I may add some loop which tries to add the runnable to the executor multiple times when previous calls fail (like trying to add up to 5 times or 10 times).

Anyway, for me it looks like a bug in some JDK versions of some vendors. It's unclear if this issue will also be triggered when running everything on x86 instead of arm.

ghost commented 2 years ago

Anyway, for me it looks like a bug in some JDK versions of some vendors. It's unclear if this issue will also be triggered when running everything on x86 instead of arm.

A little more context about the problem.

The only way we can reproduce the issue is via ReceivingService. Upgrading the JDK doesn't give us 100% confidence that the issue has been resolved (e.g., it could still be present in an updated JDK version, but happen less frequently). My guess it's strictly a 32-bit ARM problem, although I have no way to confirm.

Adding a retry loop (configurable?) along with a logging message indicating that the NullPointerException has occurred would be a helpful addition. In our situation, allowing the loop to exit without running the signal handler would probably be okay because the signal handler is re-run every second for each alarm in our system.

What about allowing users to register their own retry handler, possibly a Function or Predicate? The default retry handler could simply increment up to 5, but if we wanted to, we could have a handler that simply logs the problem and doesn't retry at all. Perhaps something like:

return DBusConnectionBuilder
  .forSessionBus()
  .withSignalThreadPriority(MIN_PRIORITY)
  .withRetryHandler( exception -> { myLogger.log( exception ); return false; } );
  .build();

Where return false means to stop retrying. The default could resemble:

  .withRetryHandler( exception -> { return count.getAndIncrement() < 5; } );
hypfvieh commented 2 years ago

I added the retry handler and did some small restructuring to separate receiving service config from connection config. So you should change the usage of the builder to something like

DBusConnectionBuilder.forSessionBus()
    .receivingThreadConfig()
        .withSignalThreadCount(4)
        .withRetryHandler((t, ex) -> {
            // do something useful and return true or false
            return false;
        })
    .connectionConfig()
    .build();
ghost commented 2 years ago

Thanks for the update. There are a few questions I have about the implementation:

if (retryHandler == null) {
  logger.error("Could not handle runnable for executor {}, runnable will be dropped", _executor, _ex);
} 
else if (_failCount < MAX_RETRIES) {
  if (retryHandler.handle(_executor, _ex)) {
    execOrFail(_executor, _r, _failCount++);
  }
  else {
    logger.debug("Ignoring unhandled runnable for executor {} due to {}, dropped by retry handler", _executor, _ex.getClass().getName());
  } 
}
else {
  logger.error("Could not handle runnable for executor {} after {} retries, runnable will be dropped", _executor, _failCount);
} 

First, if a default retryHandler was always set and guaranteed not null (e.g., asserted), could we eliminate the if statement?

if (retryHandler == null) {

Second, could the default retryHandler be responsible for logging or simply returning true? If so, would that allow us to eliminate the following if statement?

else if (_failCount < MAX_RETRIES) {

Third, is there a way to remove the recursion to avoid any possibility of stack busting?

    execOrFail(_executor, _r, _failCount++);

Consider:

void execOrFail(ExecutorNames _executor, Runnable _r) {
  if (_r == null || _executor == null) { // ignore invalid runnables or executors
    return;
  }

  int failCount = 0;
  boolean retry = false;

  do {
    try {
      ExecutorService exec = executors.get(_executor);

      if (exec == null) {
        // Possible JDK bug; should never happen, map is initialized in constructor
        throw new IllegalThreadPoolStateException("No executor found for " + _executor);
      }
      else if (closed || exec.isShutdown() || exec.isTerminated()) {
        throw new IllegalThreadPoolStateException("Receiving service already closed");
      }

      exec.execute(_r);
    } catch (IllegalThreadPoolStateException _ex) {
      // Re-throw our exception
      throw _ex;
    } catch (Exception _ex) {
      failCount++;
      retry = retryHandler.handle(_executor, _ex);

      if( !retry ) {
        logger.debug(
          "Ignoring unhandled runnable for executor {} due to {}, dropped by retry handler",
          _executor, _ex.getClass().getName()
        );
      }
    }
  }
  while( retry && failCount < MAX_FAILURES );

  if( failCount >= MAX_FAILURES ) {
    logger.error(
      "Could not handle runnable for executor {} after {} retries, runnable will be dropped",
      _executor, _failCount
    );
  }
}

Would moving MAX_FAILURES into the receiving configuration make sense, so that users can tweak the maximum retry count?

A trivial point, there's super-minor duplication in ReceivingServiceConfigBuilder:

if (retries.incrementAndGet() < 10) { ... }
LoggerFactory.getLogger(ReceivingService.class).error("Dropping runnable for {}, retry failed for more than 10 iterations, cause:", _executor, _ex);

Perhaps a constant?

DEFAULT_RETRY_COUNT = 10;
if (retries.incrementAndGet() < DEFAULT_RETRY_COUNT) { ... }
LoggerFactory.getLogger(ReceivingService.class).error("Dropping runnable for {}, retry failed for more than {} iterations, cause:", _executor, _ex, DEFAULT_RETRY_COUNT);
hypfvieh commented 2 years ago

I updated the code to follow some of you suggestions.

However, null is allowed as handler to disable all retry-handling. I thought about denying null values but I don't want to force using any retry handler.

Changing MAX_RETRIES is not planed. It's purpose is to prevent retrying indefinitely and spinning up the CPU.

Also changing the behaviour of the retry handler default implementation is not intended. If the default is not working for you, create your own implementation.

The default is to log the dropped runnable if it could not be executed after 10 retries in ERROR level. When no handler is installed, dropped runnable is logged as ERROR without any retrying. The log when the handler returns false is done in TRACE, so most of the time you wont see that. That's why the default handler logs in ERROR level as well.

ghost commented 2 years ago

I may be misunderstanding something, my apologies. Does the new algorithm execute the signal handler 50 times, if a NullPointerException occurs 50 times, thereby spamming the logs with 50 NullPointerExceptions?

In other words, how does the calling code prevent multiple retries from happening? Consider:

  1. exec.execute(_r); -- throws NullPointerException due to JDK bug
  2. failCount++ -- increments the retries
  3. !retryHandler.handle(_executor, _ex) -- returns false to indicate no more retries should be attempted
  4. while (failCount < MAX_RETRIES) -- evaluates to true, so loop continues

I think the exit condition needs to capture the result from the retry handler:

  while( retry && failCount < MAX_FAILURES ) {
    //  ... etc. ...
    retry = retryHandler.handle(_executor, _ex);

    if( !retry ) {
      // log the aborted retries.
    }
  }

Without checking retry, the loop looks like it'll run exec.execute(_r) again, which will encounter the NullPointerException again, and repeat until MAX_FAILURES is exhausted.

From what we can see in our logs, once the LinkedBlockingQueue enters the state where an NPE occurs, the NPE always occurs in very rapid succession. We'd like to prevent that situation by ceasing any attempts to retry after a single failure. (Without this ability, within a few hours, the drive fills up, which snowballs into many other issues.)

hypfvieh commented 2 years ago

I updated the code and added a unit test to verify everything is handled like I expected it to be.

The exception is only logged once when no handler present because no retry will be tried.

From what we can see in our logs, once the LinkedBlockingQueue enters the state where an NPE occurs, the NPE always occurs in very rapid succession. We'd like to prevent that situation by ceasing any attempts to retry after a single failure. (Without this ability, within a few hours, the drive fills up, which snowballs into many other issues.)

So is the ExecutorService completely broken or does this just happens more often when it occurred at least once? If the executor is broken then currently nothing would fix it because it is only created once and you cannot replace it with a new pool because you cannot access the executor service from the retry handler.

ghost commented 2 years ago

Looks good, thank you! Nice addition with returning the retry count, too.

So is the ExecutorService completely broken or does this just happens more often when it occurred at least once?

As far as I can tell, the ExecutorService works fine for a long time, at least 24 hours. Then something causes the NPE to trigger. When that happens, the log fills up with NPEs because the state never recovers:

# ls -la *syslog*
-rw-r--r-- 1 root adm 211537920 Jul 15 16:01 syslog
-rw-r--r-- 1 root adm 605098997 Jul 15 15:17 syslog.1
-rw-r--r-- 1 root adm     82771 Jul 15 00:00 syslog.2.gz
-rw-r--r-- 1 root adm     53446 Jul 14 00:00 syslog.3.gz
-rw-r--r-- 1 root adm     55189 Jul 13 00:00 syslog.4.gz
-rw-r--r-- 1 root adm     55391 Jul 12 00:00 syslog.5.gz
-rw-r--r-- 1 root adm     56359 Jul 11 00:00 syslog.6.gz
-rw-r--r-- 1 root adm     74952 Jul 10 00:00 syslog.7.gz

With the changes to the D-Bus library, we can now perform a graceful shut down of the application, which restarts the application automatically. The two large log files are overflowing with the following lines:

2022-07-15T14:50:27.380545+00:00 APP-279428 bash[368]: Exception in thread "DBus-Signal-Receiver-1" java.lang.NullPointerException
2022-07-15T14:50:27.383736+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] Exception in thread "DBus-Signal-Receiver-1"
2022-07-15T14:50:27.384139+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] java.lang.NullPointerException
2022-07-15T14:50:27.386483+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:217)
2022-07-15T14:50:27.386531+00:00 APP-279428 bash[368]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:217)
2022-07-15T14:50:27.387324+00:00 APP-279428 bash[368]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
2022-07-15T14:50:27.387693+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
2022-07-15T14:50:27.388023+00:00 APP-279428 bash[368]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
2022-07-15T14:50:27.388087+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
2022-07-15T14:50:27.388691+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
2022-07-15T14:50:27.388708+00:00 APP-279428 bash[368]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
2022-07-15T14:50:27.389396+00:00 APP-279428 bash[368]: #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2022-07-15T14:50:27.389555+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2022-07-15T14:50:27.389955+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-1, appRepeater] #011at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-15T14:50:27.390009+00:00 APP-279428 bash[368]: #011at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-15T14:50:27.401847+00:00 app-279428 app: ERROR [DBus-Signal-Receiver-2, appRepeater] Exception in thread "DBus-Signal-Receiver-2"  
...
2022-07-15T15:59:18.447632+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-3061145, appRepeater] Exception in thread "DBus-Signal-Receiver-3061145"
2022-07-15T15:59:18.447632+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-3061145, appRepeater] java.lang.NullPointerException
2022-07-15T15:59:18.447498+00:00 APP-279428 bash[368]: Exception in thread "DBus-Signal-Receiver-3061145" java.lang.NullPointerException
2022-07-15T15:59:18.447887+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-3061144, appRepeater] java.lang.NullPointerException
2022-07-15T15:59:18.447844+00:00 APP-279428 bash[368]: Exception in thread "DBus-Signal-Receiver-3061144" java.lang.NullPointerException
2022-07-15T15:59:18.449721+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-3061146, appRepeater] Exception in thread "DBus-Signal-Receiver-3061146"
2022-07-15T15:59:18.449721+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-3061146, appRepeater] java.lang.NullPointerException
2022-07-15T15:59:18.449652+00:00 APP-279428 bash[368]: Exception in thread "DBus-Signal-Receiver-3061146" java.lang.NullPointerException
2022-07-15T15:59:18.450846+00:00 APP-279428 app: ERROR [DBus-Signal-Receiver-3061147, appRepeater] Exception in thread "DBus-Signal-Receiver-3061147"

From the onset of the issue, it takes abount 1 hour and 10 minutes to fill up the drive completely.

ghost commented 2 years ago

The core JDK bug is captured in https://github.com/adoptium/adoptium-support/issues/544.

ghost commented 1 year ago

FYI, we've encountered this on JDK 17.0.4.

2022-11-21T23:57:46.804595+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] Exception in thread "DBus-Signal-Receiver-1"
2022-11-21T23:57:46.810339+00:00 APP-val app: ERROR [DBus-Signal-Receiver-2, AppRepeater] Exception in thread "DBus-Signal-Receiver-2"
2022-11-21T23:57:46.811329+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] java.lang.NullPointerException: Cannot read field "item" because "<local2>" is null
2022-11-21T23:57:46.823741+00:00 APP-val app: ERROR [DBus-Signal-Receiver-3, AppRepeater] Exception in thread "DBus-Signal-Receiver-3"
2022-11-21T23:57:46.825536+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(Unknown Source)
2022-11-21T23:57:46.827412+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
2022-11-21T23:57:46.827865+00:00 APP-val app: ERROR [DBus-Signal-Receiver-4, AppRepeater] Exception in thread "DBus-Signal-Receiver-4"
2022-11-21T23:57:46.829711+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
2022-11-21T23:57:46.834601+00:00 APP-val app: ERROR [DBus-Signal-Receiver-5, AppRepeater] Exception in thread "DBus-Signal-Receiver-5"
2022-11-21T23:57:46.835439+00:00 APP-val app: ERROR [DBus-Signal-Receiver-6, AppRepeater] Exception in thread "DBus-Signal-Receiver-6"
2022-11-21T23:57:46.835971+00:00 APP-val app: ERROR [DBus-Signal-Receiver-7, AppRepeater] Exception in thread "DBus-Signal-Receiver-7"
2022-11-21T23:57:46.840623+00:00 APP-val app: ERROR [DBus-Signal-Receiver-8, AppRepeater] Exception in thread "DBus-Signal-Receiver-8"
2022-11-21T23:57:46.859099+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] Exception in thread "DBus-Signal-Receiver-9"
2022-11-21T23:57:46.860048+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2022-11-21T23:57:46.860373+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2022-11-21T23:57:46.860884+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.lang.Thread.run(Unknown Source)
2022-11-21T23:57:46.861349+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] java.lang.NullPointerException: Cannot read field "next" because "<local1>" is null
2022-11-21T23:57:46.861793+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(Unknown Source)
2022-11-21T23:57:46.862291+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
2022-11-21T23:57:46.862715+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
2022-11-21T23:57:46.863181+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2022-11-21T23:57:46.863657+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2022-11-21T23:57:46.864193+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.lang.Thread.run(Unknown Source)

See https://github.com/adoptium/adoptium-support/issues/609

The logs get continuously spammed, despite the mitigation strategy we put in place with the DBus Java library. It's like retry handler is ignored:

        protected DBusConnection createConnection() {
            try {
                return DBusConnectionBuilder
                        .forSessionBus()
                        .receivingThreadConfig()
                        .withSignalThreadPriority(MIN_PRIORITY)
                        .withRetryHandler((t, ex) -> {
                            // A NullPointerException occurred; the JVM has entered an irrecoverable state.
                            new ShutdownEvent(ex).publish();

                            // Terminate the D-Bus thread's retry handler.
                            return false;
                        })
                        .connectionConfig()
                        .build();
            }
            catch (final DBusException e) {
                throw new RuntimeException(e);
            }
        }

Our application doesn't shut down and the signals received over the D-Bus continue to be processed. It's super weird.

hypfvieh commented 1 year ago

Sad to hear that ... Anyway I don't know what I can do about it. When the JVM goes into some 'broken' state, you pretty much out of luck to do anything "from inside". I hope you and the openJDK people can track it down... Maybe virtual threads will help in when hopefully arriving with JDK 21 ;)