TUU example with 1000 replicas fails on Stampede:

antonst commented 9 years ago

No CU directories are created on remote resource, terminal output is:

Traceback (most recent call last):
  File "launch_simulation_pattern_b_3d_tuu.py", line 53, in <module>
    pilot_kernel.run_simulation( replicas, pilot_object, session, md_kernel )
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/RepEx-0.2_feature_2d_prof_8548065_-py2.7.egg/pilot_kernels/pilot_kernel_pattern_b_multi_d.py", line 175, in run_simulation
    pilot_object.stage_in(sd_pilot)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/radical.pilot-0.33-py2.7.egg/radical/pilot/compute_pilot.py", line 517, in stage_in
    target_dir = saga.filesystem.Directory(tgt_dir_url, flags=saga.filesystem.CREATE_PARENTS)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/filesystem/directory.py", line 95, in __init__
    _adaptor, _adaptor_state, _ttype=_ttype)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/namespace/directory.py", line 95, in __init__
    _adaptor, _adaptor_state, _ttype=_ttype)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/namespace/entry.py", line 89, in __init__
    url, flags, session, ttype=_ttype)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/base.py", line 101, in __init__
    self._init_task = self._adaptor.init_instance (adaptor_state, *args, **kwargs)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/adaptors/shell/shell_file.py", line 288, in init_instance
    self._logger)
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/utils/pty_shell.py", line 247, in __init__
    self.initialize ()
  File "/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/utils/pty_shell.py", line 310, in initialize
    raise se.NoSuccess ("Shell startup on target host failed: %s" % e)
saga.exceptions.NoSuccess: Shell startup on target host failed: could not parse exit value (/bin/sh: 1: 1: Too many open files
PROMPT-2->) (/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/utils/pty_shell.py +504 (set_prompt)  :  % match)) (/home/antons/env8-opt1/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/utils/pty_shell.py +310 (initialize)  :  raise se.NoSuccess ("Shell startup on target host failed: %s" % e))

Full output with RADICAL_PILOT_VERBOSE=debug SAGA_VERBOSE=debug RADICAL_VERBOSE=debug is here

agent.log agent.err agent.out

marksantcroos commented 9 years ago

Hi Antons, Can you give me a rough estimate of the total number of CUs that would be created and the total number of concurrently CUs at any time that you would expect? Thanks!

antonst commented 9 years ago

Hi Mark, concurrent CUs = 1000; total number of CUs per cycle=6000 (max number per submission = 1000 CUs, which is followed by wait call); total number of cycles=4; total number of cores=1000;

andre-merzky commented 9 years ago

Thanks Antons. Do you know in which batch this happens? Also, could you please provide details on how to reproduce it? The problem seems clear (too many open files is a plausible error) -- but the cause is not, ie. I am not sure why so many files should be open, so I would need to debug it in some detail...

antonst commented 9 years ago

Andre, you can reproduce it by running examples/amber_pattern_b_3d_tuu (feature/2d-prof branch of repex) using as an input file ace_ala_nme_input.json:

{
    "input.PILOT": {
        "resource": "stampede.tacc.utexas.edu",
        "username" : <user>,
        "project" : <project>,
        "runtime" : "520",
        "cleanup" : "False",
        "cores" : "1000"
    },
    "input.MD": {
        "number_of_cycles": "3",
        "input_folder": "amber_inp",
        "input_file_basename": "ace_ala_nme_remd",
        "amber_input": "ace_ala_nme.mdin",
        "amber_parameters": "ace_ala_nme.parm7",
        "amber_coordinates": "ace_ala_nme.inpcrd",
        "us_template": "ace_ala_nme_us.RST",
        "replica_mpi": "False",
        "replica_cores": "1",
        "steps_per_cycle": "6000",
        "exchange_off" : "True"
        },
    "input.DIM": {
        "temperature_2": {
            "number_of_replicas": "10",
            "min_temperature": "300",
            "max_temperature": "600",
            "exchange_replica_cores" : "1",
            "exchange_replica_mpi": "False"
            },
        "umbrella_sampling_1": {
            "number_of_replicas": "10",
            "us_start_param": "0",
            "us_end_param": "360",
            "exchange_replica_cores" : "1",
            "exchange_replica_mpi": "False"
            },
        "umbrella_sampling_3": {
            "number_of_replicas": "10",
            "us_start_param": "0",
            "us_end_param": "360",
            "exchange_replica_cores" : "1",
            "exchange_replica_mpi": "False"
            }
    }
}

Which are those open files actually?

antonst commented 9 years ago

During the pre-exec of the first MD cycle are opened .mdin files and in those files are replaced place-holders with actual parameters. So 1000 files are open concurrently, which possibly causes this error. I think this is the most likely reason.

Thanks, Antons

andre-merzky commented 9 years ago

The operating system puts a limit on (a) number of files a process can open (usually 256 to 4096), and on the number of total open files for all processes. I would assume that your CUs run on different nodes, so neither limit should apply (unless the shared filesystem has a limit, which is doubtful, at least not in this range).

But the agent also keeps a number of files open, for streaming of stdout/stderr of the CUs, and to keep state. Those files are distributed over processes, so the per-process limit should also not apply, but it might be we hit the system limit...

I'll try to reproduce this on stampede. How urgent is this, ie. are you badly stuck because of this? Thanks!

antonst commented 9 years ago

Thanks for the timely response Andre. Shantenu asked me to generate some plots with data from simulations involving 1000 replicas for the next weeks proposal, so it would be great if this could be resolved in the next couple of days, otherwise I simply will not have time to get those runs through. Currently I have submitted a run involving 512 replicas as a backup.

andre-merzky commented 9 years ago

the timeliness ended when I fell asleep ;)

Anyway, can you please try the following:

clean out the sandbox on stampede (including the pilot virtualenv
in your local RP source tree, change this line: https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/agent/radical-pilot-agent-multicore.py#L196 to specify AGENT_PROCESSES
use an agent config (see http://radicalpilotdevel.readthedocs.org/en/latest/benchmarks.html#details-on-profiling) to set the number of ExecWorkers to, say, 5
pip install rp, then run again

This should distribute the load of open file handles over multiple processes, as now multiple exec workers processes (as opposed to threads) take care of CU spawning.

andre-merzky commented 9 years ago

PS.: I think this needs the devel branch of RP...

antonst commented 9 years ago

I have tried your suggestion Andre and my run failed (on 16 cores), here is terminal output: log_16. config I have used is:

pilot_description._config = {'number_of_workers' : {'StageinWorker'   :  1,
                                         'ExecWorker'      :  5,
                                         'StageoutWorker'  :  1,
                                         'UpdateWorker'    :  1},
                  'blowup_factor'     : {'Agent'           :  0,
                                         'stagein_queue'   :  0,
                                         'StageinWorker'   :  0,
                                         'schedule_queue'  :  0,
                                         'Scheduler'       :  0,
                                         'execution_queue' :  0,
                                         'ExecWorker'      :  0,
                                         'watch_queue'     :  0,
                                         'Watcher'         :  0,
                                         'stageout_queue'  :  0,
                                         'StageoutWorker'  :  0,
                                         'update_queue'    :  0,
                                         'UpdateWorker'    :  0},
                  'drop_clones'       : {'Agent'           :  0,
                                         'stagein_queue'   :  0,
                                         'StageinWorker'   :  0,
                                         'schedule_queue'  :  0,
                                         'Scheduler'       :  0,
                                         'execution_queue' :  0,
                                         'ExecWorker'      :  0,
                                         'watch_queue'     :  0,
                                         'Watcher'         :  0,
                                         'stageout_queue'  :  0,
                                         'StageoutWorker'  :  0,
                                         'update_queue'    :  0,
                                         'UpdateWorker'    :  0}}

andre-merzky commented 9 years ago

I am not exactly sure what happens here, but please try again, leaving the blowup and drop parameters at 1 (no blowup, drop only cloned units).

The log file contains:

/work/02457/antontre/radical.pilot.sandbox/rp.session.antons-pc.antons.016607.0002-pilot.0000/unit.000009/matrix_column_1_0.dat : did that unit execute correctly? Is there anything useful in the STDERR/STDOUT of that unit? I don't see any log entry that the unit actually got executed at all...

antonst commented 9 years ago

For the first set of CU's STDERR is "Killed by signal 15.". These units are reported as "done" but output staging for these units does not happen at all. I have already tried with 1s instead of 0s for 'blowup_factor' and 'drop_clones', but will do it again.

antonst commented 9 years ago

I also see:

2015:06:21 15:05:23 27731  PilotLauncherWorker-1 saga.SLURMJobService  : [WARNING ] number_of_processes not specified in submitted SLURM job description -- defaulting to 1 per total_cpu_count! (16)

Not sure why this happens.

andre-merzky commented 9 years ago

Ugh... no idea where the kill 15 comes from. Mark, any ida about the slurm warning? I don't think that should make a difference, right?

marksantcroos commented 9 years ago

2015:06:21 15:05:23 27731 PilotLauncherWorker-1 saga.SLURMJobService : [WARNING ] number_of_processes not specified in submitted SLURM job description -- defaulting to 1 per total_cpu_count! (16)

This is probably the equivalent of PBS cores:ppn. I'll create a SAGA ticket for that to investigate it.

antonst commented 9 years ago

Good news are - I have just completed a 1000 replica run on Stampede. I have used feature/tuu-opt3 branch of repex and devel branch of rp. Bad news are I got one CU failing with:

forrtl: severe (24): end-of-file during read, unit 5, file /work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016618.0001-pilot.0000/unit.004790/ace_ala_nme_remd_785_4.mdin
Image              PC                Routine            Line        Source
libintlc.so.5      00002B08BE59DA1E  Unknown               Unknown  Unknown
libintlc.so.5      00002B08BE59C4B6  Unknown               Unknown  Unknown
libifcore.so.5     00002B08BDE6A01E  Unknown               Unknown  Unknown
libifcore.so.5     00002B08BDDD9B1E  Unknown               Unknown  Unknown
libifcore.so.5     00002B08BDDD901D  Unknown               Unknown  Unknown
libifcore.so.5     00002B08BDE1224C  Unknown               Unknown  Unknown
sander             00000000004DAF7E  Unknown               Unknown  Unknown
sander             00000000004AEF52  Unknown               Unknown  Unknown
sander             00000000004AED36  Unknown               Unknown  Unknown
sander             000000000040E39C  Unknown               Unknown  Unknown
libc.so.6          0000003FFA21ED5D  Unknown               Unknown  Unknown
sander             000000000040E299  Unknown               Unknown  Unknown
Traceback (most recent call last):
  File "matrix_calculator_temp_ex.py", line 137, in <module>
    shutil.copyfile(src, dst)
  File "/usr/lib64/python2.6/shutil.py", line 50, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: u'/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016618.0001-pilot.0000/unit.004790/ace_ala_nme_remd_785_4.mdinfo'

I have no idea why this happens

antonst commented 9 years ago

Just for the record, I have also observed this (US dimension):

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
sander             0000000000585B82  Unknown               Unknown  Unknown
sander             000000000078B7BE  Unknown               Unknown  Unknown
sander             00000000004FCF95  Unknown               Unknown  Unknown
sander             00000000004B4AAA  Unknown               Unknown  Unknown
sander             00000000004AED36  Unknown               Unknown  Unknown
sander             000000000040E39C  Unknown               Unknown  Unknown
libc.so.6          0000003C2461ED5D  Unknown               Unknown  Unknown
sander             000000000040E299  Unknown               Unknown  Unknown
Killed by signal 15.^M

I guess some fault tolerance mechanism needs to be implemented...

ibethune commented 7 years ago

Not touched for 2 years -> backburner

radical-cybertools / radical.repex.at

TUU example with 1000 replicas fails on Stampede: #33