NPE when listing repairs

adejanovski commented 5 months ago

Without an understanding of what's going on so far, I'm seeing the following error popping up on a reaper instance running in k8ssandra-operator:

ERROR  [2024-01-26 10:42:02,249] [dw-3200 - GET /repair_run?cluster_name=all&limit=10] i.d.j.e.LoggingExceptionMapper - Error handling a request: bf338ca94d1b3864 
java.lang.NullPointerException: null
    at io.cassandrareaper.storage.repairrun.CassandraRepairRunDao.lambda$getRepairRunsForClusterPrioritiseRunning$3(CassandraRepairRunDao.java:404)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at io.cassandrareaper.storage.repairrun.CassandraRepairRunDao.getRepairRunsForClusterPrioritiseRunning(CassandraRepairRunDao.java:406)
    at io.cassandrareaper.storage.repairrun.CassandraRepairRunDao.getRepairRunsForClusterPrioritiseRunning(CassandraRepairRunDao.java:58)
    at io.cassandrareaper.resources.RepairRunResource.lambda$listRepairRuns$1(RepairRunResource.java:726)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
    at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
    at io.cassandrareaper.resources.RepairRunResource.listRepairRuns(RepairRunResource.java:725)
    at jdk.internal.reflect.GeneratedMethodAccessor70.invoke(Unknown Source)

While we need to investigate the underlying issue, we should protect ourselves from the NPE and return a proper error message while letting the lambda pursue its work.

The fact that we're invoking row.getUninterruptibly().one() without any prior check that a row exists is a bet that we may often lose, like we currently are:

    return repairRunFutures
        .stream()
        .map(
            row -> {
              Row extractedRow = row.getUninterruptibly().one();
              return buildRepairRunFromRow(extractedRow, extractedRow.getUUID("id"));
            }
        ).collect(Collectors.toList());

Miles-Garnsey commented 4 months ago

Which lambda are we wanting to continue here? I'm not sure what you're referring to. Am I right in thinking that we want to check that the ResultSet returned by row.getUninterruptibly() is not empty?

If the ResultSet is empty, do we want to throw a ReaperException for handling up the stack, or just log and return an empty List<RepairRun>?

(NB: I'll tackle this since it is surely an error in code I've written...)

adejanovski commented 4 months ago

I'll de-prioritize this for now, the root cause was a consistency issue in the db. This is a nice to have but not high priority.

Miles-Garnsey commented 4 months ago

I would have been able to work on this today (as I think my other tickets are done, and I don't have anything else queued up) but I need answers to the above questions @adejanovski...

SesquipedalianDefenestrator commented 3 months ago

This hit us as well, and I ended up making a PR before noticing the Issue: https://github.com/thelastpickle/cassandra-reaper/pull/1478

DB inconsistency is the root cause, but in our case it was cased by Reaper filling the disk with logs (which it tends to do if anything goes wrong) on the VM. Enough that I think we've hit it a few times before, but just solved it previously by wiping the DB. This round I had time to go digging.

thelastpickle / cassandra-reaper

NPE when listing repairs #1463