quaquel / EMAworkbench

workbench for performing exploratory modeling and analysis
BSD 3-Clause "New" or "Revised" License
127 stars 90 forks source link

Platypus and MPIEvaluator Issue #370

Open pollockDeVis opened 1 month ago

pollockDeVis commented 1 month ago

@quaquel : Encountered the following issue while running optimization on 60 cores on the HPC. Crashed after the progress of the optimization was 87%. A ValueError in Platypus core triggered AttributeError: 'MPIEvaluator' object has no attribute 'logwatcher_thread'. EMA Workbench version: 2.4.1

 87%|████████████████████████▏   | 129901/150000 [83:50:59<11:59:20,  2.15s/it]
 87%|████████████████████████▎   | 130393/150000 [84:08:48<11:44:06,  2.15s/it]Traceback (most recent call last):
  File "/scratch/palokbiswas/Repo/JUSTICE/analysis/analyzer.py", line 213, in run_optimization_adaptive
    results = evaluator.optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/evaluators.py", line 228, in optimize
    return optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/evaluators.py", line 576, in optimize
    return _optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/optimization.py", line 1101, in _optimize
    optimizer.run(nfe)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 410, in run
    self.step()
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/algorithms.py", line 1521, in step
    self.algorithm.step()
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/algorithms.py", line 182, in step
    self.iterate()
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/algorithms.py", line 212, in iterate
    self.archive.extend(self.population)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 805, in extend
    self.append(solution)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 801, in append
    self.add(solution)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 979, in add
    flags = [self._dominance.compare(solution, s) for s in self._contents]
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 979, in <listcomp>
    flags = [self._dominance.compare(solution, s) for s in self._contents]
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 713, in compare
    i1 = math.floor(o1 / epsilon)
ValueError: cannot convert float NaN to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/palokbiswas/Repo/JUSTICE/hpc_run.py", line 15, in <module>
    run_optimization_adaptive(n_rbfs=4, n_inputs=2, nfe=nfe, swf=swf, seed=seed)
  File "/scratch/palokbiswas/Repo/JUSTICE/analysis/analyzer.py", line 213, in run_optimization_adaptive
    results = evaluator.optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/evaluators.py", line 109, in __exit__
    self.finalize()
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/util/ema_logging.py", line 153, in wrapper
    res = func(*args, **kwargs)
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/futures_mpi.py", line 213, in finalize
    self.logwatcher_thread.join(timeout=60)
AttributeError: 'MPIEvaluator' object has no attribute 'logwatcher_thread'

 87%|████████████████████████▎   | 130393/150000 [84:26:28<12:41:50,  2.33s/it]
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[29705,1],0]
  Exit code:    1
--------------------------------------------------------------------------
EwoutH commented 1 month ago

Thanks for reporting this potential issue. Could you:

  1. Update to the latest release of the EMAworkbench, 2.5.2
  2. If the issue persists, create a Minimal, Reproducible Example

Edit: A run crashing after 84 hours of running on a HPC, I feel you. Could you salvage any results?

@quaquel maybe we should add checkpoint functionality, that saves results ever nth iteration or ever n % of runs.

quaquel commented 1 month ago

checkpointing and restarts are indeed urgently needed

pollockDeVis commented 1 month ago

The error is random, and the probability of its occurring increases when --ntasks in HPC ask for more than 50 cores. I have run a few more jobs on the same experiment after this, and it worked. At 60 cores, you can start seeing this error more often.

quaquel commented 1 month ago

It's strange. The error seems to occur within platypus. So, it should not be related to the number of cores or the nature of parallelization. At most, it might relate to the total number of function evaluations.