Open pollockDeVis opened 1 month ago
Thanks for reporting this potential issue. Could you:
Edit: A run crashing after 84 hours of running on a HPC, I feel you. Could you salvage any results?
@quaquel maybe we should add checkpoint functionality, that saves results ever nth iteration or ever n % of runs.
checkpointing and restarts are indeed urgently needed
The error is random, and the probability of its occurring increases when --ntasks
in HPC ask for more than 50 cores. I have run a few more jobs on the same experiment after this, and it worked. At 60 cores, you can start seeing this error more often.
It's strange. The error seems to occur within platypus. So, it should not be related to the number of cores or the nature of parallelization. At most, it might relate to the total number of function evaluations.
@quaquel : Encountered the following issue while running optimization on 60 cores on the HPC. Crashed after the progress of the optimization was 87%. A ValueError in Platypus core triggered AttributeError: 'MPIEvaluator' object has no attribute 'logwatcher_thread'. EMA Workbench version: 2.4.1