During a parameter study with ActiveLearning, the ZeroMQ interface lost its connection for the points with a longer runtime (approx. 45min) after the run was complete and aborted. The interface worked correctly for runs with a shorter runtime of a few minutes.
The study was executed on the Marconi cluster with the following profit.yaml:
# Date: 2022-01-17
# GORILLA: internal version with mono_energetic_transp_main.x
# proFit: @mkendler - 2022_01_14_marconi
# target system: Marconi
#
# Task:
# AL preview
ntrain: 60
variables:
# normalized collisionality
nu_star: ActiveLearning(1e-4, 1e-1)
# mach number
v_E: ActiveLearning(0, 2e-4)
# Energy in eV
E: 3000
# particle species (1 = electrons, 2 = deuterium ions)
species: 1
# number of particles (for the monte carlo simulation)
n_particles: 200
# mono energetic radial diffusion coefficient
D11: Output
D11_std: Output
run:
runner:
class: slurm
OpenMP: True
cpus: all
options:
job-name: profit-gorilla
account: FUA35_TSVVSTOP
# skl_usr_dbg: no QOS, min 1 node, max 4 nodes, max 30:00, dedicated nodes
# skl_usr_prod: no QOS, min 1 node, max 32 nodes, max 24:00:00
# : qos_lowprio, max 64 nodes, max 24:00:00
partition: skl_fua_prod
#qos: qos_lowprio
time: 01:20:00
interface:
class: zeromq
port: 9100
pre:
class: template
path: ./template
param_files: [mono_energetic_transp_coef.inp, gorilla.inp]
post:
class: numpytxt
path: nustar_diffcoef_std.dat
names: "IGNORE D11 D11_std"
command: ./mono_energetic_transp_main.x
clean: False
fit:
surrogate: CoregionalizedGPy
load: model.pkl
active_learning:
algorithm:
class: simple
acquisition_function:
class: simple_exploration
nwarmup: 20
batch_size: 5
resume_from: 40
Thanks. I have added a preliminary backup method for the worker if it cannot reconnect so that data isn't lost. This will probably have to be redone with a larger refactor of the run system.
During a parameter study with
ActiveLearning
, the ZeroMQ interface lost its connection for the points with a longer runtime (approx. 45min) after the run was complete and aborted. The interface worked correctly for runs with a shorter runtime of a few minutes.The study was executed on the Marconi cluster with the following
profit.yaml
: