project8 / hercules

Python package for scripting our simulation workflows
1 stars 0 forks source link

Parallel job submission for desktop mode does not work. #11

Closed charstnut closed 3 years ago

charstnut commented 3 years ago

It seems that if the desktop job number is greater than 1, the locust in the container fails before completing. Here is the log output from locust.

This is KASPER v3.7.5 (build: 2021-02-24T04:53:24Z) [git:n/a]
** KASPER source directory set to /tmp_source/kassiopeia
** KASPER install directory set to /usr/local/p8/locust/v2.2.0
** KASPER config directory set to /home/LocustPhase3Template.json
** Setting KEMField cache to /usr/local/p8/locust/v2.2.0/cache/KEMField

RooFit v3.60 -- Developed by Wouter Verkerke and David Kirkby
                Copyright (C) 2000-2013 NIKHEF, University of California & Stanford University
                All rights reserved, please read http://roofit.sourceforge.net/license.txt

Error in cling::AutoloadingVisitor::InsertIntoAutoloadingState:
   Missing FileEntry for LMCTrack.hh
   requested to autoload type locust::Track
2021-03-24 23:52:52 [ PROG] ons/LocustSim.cc(32): Welcome to Locust_MC

                 (                                     *
                 )\ )                         )      (  `      (
                (()/(              (       ( /(      )\))(     )\
                 /(_))  (    (    ))\  (   )\())    ((_)()\  (((_)
                (_))    )\   )\  /((_) )\ (_))/     (_()((_) )\___
                | |    ((_) ((_)(_))( ((_)| |_      |  \/  |((/ __|
                | |__ / _ \/ _| | || |(_-<|  _|     | |\/| | | (__
                |____|\___/\__|  \_,_|/__/ \__|_____|_|  |_|  \___|
                                              |_____|

2021-03-24 23:52:52 [ PROG] /configurator.cc(158): Final configuration:

{
    array-signal :
    {
        array-radius : 0.100000
        element-spacing : 0.007753
        event-spacing-samples : 15000
        lo-frequency : 25878100000.000000
        n-subarrays : 1
        nelements-per-strip : 5
        power-combining-feed : slotted-waveguide
        tf-receiver-bin-width : 10000000.000000
        tf-receiver-filename : /tmp/Phase3/TransferFunctions/FiveSlotTF.txt
        transmitter : kassiopeia
        voltage-check : false
        xml-filename : /home/LocustKassElectrons.xml
        zshift-array : 0.000000
    }

    decimate-signal :
    {
    }

    digitizer :
    {
        v-offset : -0.000000
        v-range : 0.000000
    }

    gaussian-noise :
    {
        domain : time
        random-seed : 42
    }

    generators :
    [
        array-signal
        lpf-fft
        decimate-signal
        digitizer
    ]

    lpf-fft :
    {
    }

    simulation :
    {
        egg-filename : /home/simulation.egg
        n-channels : 60
        n-records : 1
        record-size : 300000
    }

}

2021-03-24 23:52:52 [ PROG] ons/LocustSim.cc(36): Setting up generator toolbox
2021-03-24 23:52:52 [ PROG] ons/LocustSim.cc(44): Setting up simulation controller
2021-03-24 23:52:52 [ PROG] ons/LocustSim.cc(53): Preparing for run
2021-03-24 23:52:52 [ PROG] ons/LocustSim.cc(60): Beginning simulation run
[KSMAIN NORMAL MESSAGE] starting ...
[KEMFIELD NORMAL MESSAGE] Cannot access directory /tmp_source/build/cache/KEMField
[KSMAIN WARNING MESSAGE] legacy binding for magnetic field <field_electromagnet> is DEPRECATED - please move objects to <kemfield> tag
[KSMAIN WARNING MESSAGE] legacy binding for magnetic field <field_magnetic_main> is DEPRECATED - please move objects to <kemfield> tag
[KEMFIELD WARNING MESSAGE] Added 5 elements to existing coaxial group (coax. tolerance 1e-10)
[KEMFIELD NORMAL MESSAGE] Computing central source points for MagnetostaticBasis along the local z-axis from -0.0005 to 0.185028.
2021-03-24 23:52:52 [ PROG] gnalGenerator.cc(360): LMC about to wait
[KEMFIELD NORMAL MESSAGE] 339 central source points have been computed
[KEMFIELD NORMAL MESSAGE] [100%]
[KEMFIELD NORMAL MESSAGE] Computing 3 remote source points for MagnetostaticBasis along the local z-axis from -0.1 to 0.1.
[KEMFIELD NORMAL MESSAGE] [100%]
[KSMAIN NORMAL MESSAGE] ☻   welcome to Kassiopeia 3.7.4  ☻
****[KSRUN NORMAL MESSAGE] processing run 0 ...
2021-03-24 23:52:53 [ PROG] /LMCEventHold.cc(40): Kass is waiting for event trigger
2021-03-24 23:52:53 [ PROG] gnalGenerator.cc(625): LMC ReceivedKassReady
2021-03-24 23:52:53 [ PROG] gnalGenerator.cc(638): LMC about to WakeBeforeEvent()
2021-03-24 23:52:53 [ PROG] /LMCEventHold.cc(50): Kass got the event trigger
********[KSEVENT NORMAL MESSAGE] processing event 0 <gen_uniform> ...
************[KSTRACK NORMAL MESSAGE] processing track 0 <gen_uniform> ...
****************[KSSTEP NORMAL MESSAGE] processing step 1000 ... (z = 4.84377e-05, r = 3.38574e-07, k = 18600, e = 18600****************[KSSTEP NORMAL MESSAGE] processing step 2000 ... (z = -9.68446e-05, r = 6.77154e-07, k = 18600, e = 1860****************[KSSTEP NORMAL MESSAGE] processing step 4000 ... (z = -0.000193442, r = 1.35433e-06, k = 18600, e = 1860****************[KSSTEP NORMAL MESSAGE] processing step 6000 ... (z = -0.000289545, r = 2.03154e-06, k = 18600, e = 1860****************[KSSTEP NORMAL MESSAGE] processing step 7000 ... (z = 0.000337334, r = 2.37015e-06, k = 18600, e = 18600****************[KSSTEP NORMAL MESSAGE] processing step 8000 ... (z = -0.000384908, r = 2.70877e-06, k = 18600, e = 1860****************[KSSTEP NORMAL MESSAGE] processing step 9000 ... (z = 0.000432236, r = 3.04739e-06, k = 18600, e = 18600****************[KSSTEP NORMAL MESSAGE] processing step 10000 ... (z = -0.000479288, r = 3.38602e-06, k = 18600, e = 186****************[KSSTEP NORMAL MESSAGE] processing step 11000 ... (z = 0.000526033, r = 3.72466e-06, k = 18600, e = 1860************[KSTRACK NORMAL MESSAGE] ...completed track 0 <term_max_time> after 11513 steps at <-0.000346072 0.000141719 -0.00185618>
2021-03-24 23:53:47 [ PROG] /LMCEventHold.cc(67): Kass is waking after event
********[KSEVENT NORMAL MESSAGE] ...completed event 0 <gen_uniform>  in 55 s (ca. 0 s left)
****[KSRUN NORMAL MESSAGE] ...run 0 complete
[KSMAIN NORMAL MESSAGE] finished!
[KSWRITER NORMAL MESSAGE] ROOT output was written to file </home/Phase3Seed43Output.root>
[INITIALIZATION WARNING MESSAGE] It took 55.5 s to process the file </home/LocustKassElectrons.xml>
[KSMAIN NORMAL MESSAGE] ... finished
2021-03-24 23:53:47 [ PROG] gnalGenerator.cc(360): LMC about to wait
2021-03-24 23:53:47 [ PROG] gnalGenerator.cc(625): LMC ReceivedKassReady
2021-03-24 23:53:47 [ PROG] gnalGenerator.cc(638): LMC about to WakeBeforeEvent()
2021-03-24 23:53:48 [ PROG] gnalGenerator.cc(691): Finished signal loop.
/workingdir/locustcommands.sh: line 4:    58 Killed                  LocustSim config=$1
charstnut commented 3 years ago

I hotfixed this in the new branch in config improvements. Just as a reference.

MCFlowMace commented 3 years ago

I cannot reproduce that error on my desktop. What is your script? How many parallel jobs did you want to use and how much RAM do you have?

charstnut commented 3 years ago

@MCFlowMace The script for generation is pasted below. Can you pull from the hotfix I posted and test if that works on your desktop? The fix is basically moving the locustcommands.sh script to each subdirectory where the sim output goes. I think this is related to two docker container calling the same script loaded from mounted volume. The parallel job count is 2 in this case.

### This file generates the local template set.
import os
import hercules as he
import numpy as np
from pathlib import Path

config_file = Path(os.path.dirname(__file__)).joinpath("hercules_config.ini")
working_dir = Path("/").joinpath("mnt", "d", "Data")
# working_dir = Path("~/TempData").expanduser()
sim = he.KassLocustP3(working_dir, config_file)

# Does the number of channels automatically form a ring? Seems yes

n_channels = 60
r_range = np.linspace(0.002, 0.008, 8)
theta_range = np.linspace(89.7, 90.0, 30)

# r_range = np.linspace(0, 0.01, 1)
# theta_range = np.linspace(89.7, 90.0, 2)
r_phi_range = np.linspace(0, 2 * np.pi / 60, 1)

config_list = []

for theta in theta_range:
    for r_phi in r_phi_range:
        for r in r_range:
            x = r * np.cos(r_phi)
            y = r * np.sin(r_phi)
            r_phi_deg = np.rad2deg(r_phi)
            name = "Sim_theta_{:.4f}_R_{:.4f}_phi_{:.4f}".format(
                theta, r, r_phi_deg)
            config = he.SimConfig(name,
                                  n_channels=n_channels,
                                  seed_locust=42,
                                  seed_kass=43,
                                  egg_filename="simulation.egg",
                                  x_min=x,
                                  x_max=x,
                                  y_min=y,
                                  y_max=y,
                                  z_min=0,
                                  z_max=0,
                                  theta_min=theta,
                                  theta_max=theta,
                                  t_max=5e-6,
                                  v_range=3.0e-7,
                                  presample_spacing=150000,
                                  geometry='FreeSpaceGeometry_V00_00_10.xml')
            config_list.append(config)

sim(config_list)
MCFlowMace commented 3 years ago

For me the script above runs fine even without your fix. Again the question, how much RAM do you have? Other than that it could indeed be due to the mounted drive, which I don't have. By the way do you run it on Linux or with WSL?

Could you isolate your hotfix from your other work on a new branch that we can merge separately?

charstnut commented 3 years ago

My RAM is 16GB and it is not fully utilized when I implement the hotfix and run 2 jobs in parallel. I've also tested the mnt drive, and since I am using WSL, even when the path is something like ~/TempData the script still did not work. I am going to create a hotfix branch for this issue. If you think the hotfix works on your end as well we can merge it then. I think this might be bugs in Windows Docker for WSL environment but I'm not sure, since you can't reproduce it.