powsybl / powsybl-hpc

High Performance Computing modules for powsybl
https://www.powsybl.org
Mozilla Public License 2.0
2 stars 0 forks source link

[Slurm] NPE on non zero exit code in job array #56

Closed sylvlecl closed 3 years ago

sylvlecl commented 3 years ago

Bug

An exception encountered after a non zero exit code in a job of a job array :

java.lang.NullPointerException at java.util.Objects.requireNonNull(Objects.java:203) at 
com.powsybl.computation.ExecutionError.<init>(ExecutionError.java:24) at com.powsybl.computation.slurm.JobArraySlurmTask.convertScontrolResult2Error(JobArraySlurmTask.java:95) at 
com.powsybl.computation.slurm.AbstractTask.generateReport(AbstractTask.java:140) at 
com.powsybl.computation.slurm.AbstractTask.await(AbstractTask.java:130) at 
com.powsybl.computation.slurm.SlurmComputationManager.doExecute(SlurmComputationManager.java:277) at 
com.powsybl.computation.slurm.SlurmComputationManager.lambda$execute$0(SlurmComputationManager.java:220) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
com.powsybl.computation.CompletableFutureTask.run(CompletableFutureTask.java:43) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

Another one, encountered in a case where multiple commands have non-zero exit codes :

Caused by: java.lang.NumberFormatException: For input string: "12-39"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at com.powsybl.computation.slurm.ScontrolCmd$ScontrolResultBean.parseShowJob(ScontrolCmd.java:146)
    at com.powsybl.computation.slurm.ScontrolCmd$ScontrolResultBean.parse(ScontrolCmd.java:123)
    at com.powsybl.computation.slurm.ScontrolCmd$ScontrolResultBean.<init>(ScontrolCmd.java:81)
    at com.powsybl.computation.slurm.ScontrolCmd$ScontrolResult.parse(ScontrolCmd.java:52)
    at com.powsybl.computation.slurm.ScontrolCmd$ScontrolResult.<init>(ScontrolCmd.java:44)
    at com.powsybl.computation.slurm.ScontrolCmd.send(ScontrolCmd.java:35)
    at com.powsybl.computation.slurm.AbstractTask.generateReport(AbstractTask.java:137)
    at com.powsybl.computation.slurm.AbstractTask.await(AbstractTask.java:130)
    at com.powsybl.computation.slurm.SlurmComputationManager.doExecute(SlurmComputationManager.java:277)
    at com.powsybl.computation.slurm.SlurmComputationManager.lambda$execute$0(SlurmComputationManager.java:220)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at com.powsybl.computation.CompletableFutureTask.run(CompletableFutureTask.java:43)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

TODO

In general, to allow proper error handling by the client, a non-zero exit code must not throw but report an ExecutionError. The problem seems to be that we don't have any command associated with a job ID retrieved from scontrol result, here:

    @Override
    ExecutionError convertScontrolResult2Error(ScontrolCmd.ScontrolResultBean scontrolResultBean) {
        return new ExecutionError(commandByJobId.get(scontrolResultBean.getJobId()), scontrolResultBean.getArrayTaskId(), scontrolResultBean.getExitCode());
    }

Maybe in that case it's better to simply not report an ExecutionError ?

Proper error handling by the computation manager user.

sylvlecl commented 3 years ago

Fixed in #58