redmod-team / profit

Probabilistic Response mOdel Fitting with Interactive Tools
https://profit.readthedocs.io
MIT License
14 stars 9 forks source link

ZeroMQ interface loses connection #158

Closed mkendler closed 2 years ago

mkendler commented 2 years ago

During a parameter study with ActiveLearning, the ZeroMQ interface lost its connection for the points with a longer runtime (approx. 45min) after the run was complete and aborted. The interface worked correctly for runs with a shorter runtime of a few minutes.

grafik

The study was executed on the Marconi cluster with the following profit.yaml:

# Date: 2022-01-17
# GORILLA: internal version with mono_energetic_transp_main.x
# proFit: @mkendler - 2022_01_14_marconi 
# target system: Marconi
#
# Task:
# AL preview

ntrain: 60
variables:
    # normalized collisionality
    nu_star: ActiveLearning(1e-4, 1e-1)
    # mach number
    v_E: ActiveLearning(0, 2e-4)
    # Energy in eV
    E: 3000
    # particle species (1 = electrons, 2 = deuterium ions)
    species: 1
    # number of particles (for the monte carlo simulation)
    n_particles: 200
    # mono energetic radial diffusion coefficient
    D11: Output
    D11_std: Output

run:
    runner:
        class: slurm
        OpenMP: True
        cpus: all
        options:
            job-name: profit-gorilla
            account: FUA35_TSVVSTOP
            # skl_usr_dbg: no QOS, min 1 node, max 4 nodes, max 30:00, dedicated nodes
            # skl_usr_prod: no QOS, min 1 node, max 32 nodes, max 24:00:00
            #             : qos_lowprio, max 64 nodes, max 24:00:00
            partition: skl_fua_prod
            #qos: qos_lowprio
            time: 01:20:00
    interface:
        class: zeromq
        port: 9100
    pre:
        class: template
        path: ./template
        param_files: [mono_energetic_transp_coef.inp, gorilla.inp]
    post:
        class: numpytxt
        path: nustar_diffcoef_std.dat
        names: "IGNORE D11 D11_std"
    command: ./mono_energetic_transp_main.x
    clean: False

fit:
    surrogate: CoregionalizedGPy
    load: model.pkl

active_learning:
    algorithm:
        class: simple
        acquisition_function:
            class: simple_exploration
    nwarmup: 20
    batch_size: 5
    resume_from: 40
Rykath commented 2 years ago

Thanks. I have added a preliminary backup method for the worker if it cannot reconnect so that data isn't lost. This will probably have to be redone with a larger refactor of the run system.