Open antonst opened 9 years ago
Hi Antons, Can you give me a rough estimate of the total number of CUs that would be created and the total number of concurrently CUs at any time that you would expect? Thanks!
Hi Mark, concurrent CUs = 1000; total number of CUs per cycle=6000 (max number per submission = 1000 CUs, which is followed by wait call); total number of cycles=4; total number of cores=1000;
Thanks Antons. Do you know in which batch this happens? Also, could you please provide details on how to reproduce it? The problem seems clear (too many open files is a plausible error) -- but the cause is not, ie. I am not sure why so many files should be open, so I would need to debug it in some detail...
Andre, you can reproduce it by running examples/amber_pattern_b_3d_tuu (feature/2d-prof branch of repex) using as an input file ace_ala_nme_input.json:
{
"input.PILOT": {
"resource": "stampede.tacc.utexas.edu",
"username" : <user>,
"project" : <project>,
"runtime" : "520",
"cleanup" : "False",
"cores" : "1000"
},
"input.MD": {
"number_of_cycles": "3",
"input_folder": "amber_inp",
"input_file_basename": "ace_ala_nme_remd",
"amber_input": "ace_ala_nme.mdin",
"amber_parameters": "ace_ala_nme.parm7",
"amber_coordinates": "ace_ala_nme.inpcrd",
"us_template": "ace_ala_nme_us.RST",
"replica_mpi": "False",
"replica_cores": "1",
"steps_per_cycle": "6000",
"exchange_off" : "True"
},
"input.DIM": {
"temperature_2": {
"number_of_replicas": "10",
"min_temperature": "300",
"max_temperature": "600",
"exchange_replica_cores" : "1",
"exchange_replica_mpi": "False"
},
"umbrella_sampling_1": {
"number_of_replicas": "10",
"us_start_param": "0",
"us_end_param": "360",
"exchange_replica_cores" : "1",
"exchange_replica_mpi": "False"
},
"umbrella_sampling_3": {
"number_of_replicas": "10",
"us_start_param": "0",
"us_end_param": "360",
"exchange_replica_cores" : "1",
"exchange_replica_mpi": "False"
}
}
}
Which are those open files actually?
During the pre-exec of the first MD cycle are opened .mdin files and in those files are replaced place-holders with actual parameters. So 1000 files are open concurrently, which possibly causes this error. I think this is the most likely reason.
Thanks, Antons
The operating system puts a limit on (a) number of files a process can open (usually 256 to 4096), and on the number of total open files for all processes. I would assume that your CUs run on different nodes, so neither limit should apply (unless the shared filesystem has a limit, which is doubtful, at least not in this range).
But the agent also keeps a number of files open, for streaming of stdout/stderr of the CUs, and to keep state. Those files are distributed over processes, so the per-process limit should also not apply, but it might be we hit the system limit...
I'll try to reproduce this on stampede. How urgent is this, ie. are you badly stuck because of this? Thanks!
Thanks for the timely response Andre. Shantenu asked me to generate some plots with data from simulations involving 1000 replicas for the next weeks proposal, so it would be great if this could be resolved in the next couple of days, otherwise I simply will not have time to get those runs through. Currently I have submitted a run involving 512 replicas as a backup.
the timeliness ended when I fell asleep ;)
Anyway, can you please try the following:
AGENT_PROCESSES
ExecWorkers
to, say, 5This should distribute the load of open file handles over multiple processes, as now multiple exec workers processes (as opposed to threads) take care of CU spawning.
PS.: I think this needs the devel branch of RP...
I have tried your suggestion Andre and my run failed (on 16 cores), here is terminal output: log_16. config I have used is:
pilot_description._config = {'number_of_workers' : {'StageinWorker' : 1,
'ExecWorker' : 5,
'StageoutWorker' : 1,
'UpdateWorker' : 1},
'blowup_factor' : {'Agent' : 0,
'stagein_queue' : 0,
'StageinWorker' : 0,
'schedule_queue' : 0,
'Scheduler' : 0,
'execution_queue' : 0,
'ExecWorker' : 0,
'watch_queue' : 0,
'Watcher' : 0,
'stageout_queue' : 0,
'StageoutWorker' : 0,
'update_queue' : 0,
'UpdateWorker' : 0},
'drop_clones' : {'Agent' : 0,
'stagein_queue' : 0,
'StageinWorker' : 0,
'schedule_queue' : 0,
'Scheduler' : 0,
'execution_queue' : 0,
'ExecWorker' : 0,
'watch_queue' : 0,
'Watcher' : 0,
'stageout_queue' : 0,
'StageoutWorker' : 0,
'update_queue' : 0,
'UpdateWorker' : 0}}
I am not exactly sure what happens here, but please try again, leaving the blowup and drop parameters at 1
(no blowup, drop only cloned units).
The log file contains:
/work/02457/antontre/radical.pilot.sandbox/rp.session.antons-pc.antons.016607.0002-pilot.0000/unit.000009/matrix_column_1_0.dat
: did that unit execute correctly? Is there anything useful in the STDERR/STDOUT of that unit? I don't see any log entry that the unit actually got executed at all...
For the first set of CU's STDERR is "Killed by signal 15.". These units are reported as "done" but output staging for these units does not happen at all. I have already tried with 1s instead of 0s for 'blowup_factor' and 'drop_clones', but will do it again.
I also see:
2015:06:21 15:05:23 27731 PilotLauncherWorker-1 saga.SLURMJobService : [WARNING ] number_of_processes not specified in submitted SLURM job description -- defaulting to 1 per total_cpu_count! (16)
Not sure why this happens.
Ugh...
no idea where the kill 15
comes from. Mark, any ida about the slurm warning? I don't think that should make a difference, right?
2015:06:21 15:05:23 27731 PilotLauncherWorker-1 saga.SLURMJobService : [WARNING ] number_of_processes not specified in submitted SLURM job description -- defaulting to 1 per total_cpu_count! (16)
This is probably the equivalent of PBS cores:ppn. I'll create a SAGA ticket for that to investigate it.
Good news are - I have just completed a 1000 replica run on Stampede. I have used feature/tuu-opt3 branch of repex and devel branch of rp. Bad news are I got one CU failing with:
forrtl: severe (24): end-of-file during read, unit 5, file /work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016618.0001-pilot.0000/unit.004790/ace_ala_nme_remd_785_4.mdin
Image PC Routine Line Source
libintlc.so.5 00002B08BE59DA1E Unknown Unknown Unknown
libintlc.so.5 00002B08BE59C4B6 Unknown Unknown Unknown
libifcore.so.5 00002B08BDE6A01E Unknown Unknown Unknown
libifcore.so.5 00002B08BDDD9B1E Unknown Unknown Unknown
libifcore.so.5 00002B08BDDD901D Unknown Unknown Unknown
libifcore.so.5 00002B08BDE1224C Unknown Unknown Unknown
sander 00000000004DAF7E Unknown Unknown Unknown
sander 00000000004AEF52 Unknown Unknown Unknown
sander 00000000004AED36 Unknown Unknown Unknown
sander 000000000040E39C Unknown Unknown Unknown
libc.so.6 0000003FFA21ED5D Unknown Unknown Unknown
sander 000000000040E299 Unknown Unknown Unknown
Traceback (most recent call last):
File "matrix_calculator_temp_ex.py", line 137, in <module>
shutil.copyfile(src, dst)
File "/usr/lib64/python2.6/shutil.py", line 50, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: u'/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016618.0001-pilot.0000/unit.004790/ace_ala_nme_remd_785_4.mdinfo'
I have no idea why this happens
Just for the record, I have also observed this (US dimension):
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
sander 0000000000585B82 Unknown Unknown Unknown
sander 000000000078B7BE Unknown Unknown Unknown
sander 00000000004FCF95 Unknown Unknown Unknown
sander 00000000004B4AAA Unknown Unknown Unknown
sander 00000000004AED36 Unknown Unknown Unknown
sander 000000000040E39C Unknown Unknown Unknown
libc.so.6 0000003C2461ED5D Unknown Unknown Unknown
sander 000000000040E299 Unknown Unknown Unknown
Killed by signal 15.^M
I guess some fault tolerance mechanism needs to be implemented...
Not touched for 2 years -> backburner
No CU directories are created on remote resource, terminal output is:
Full output with RADICAL_PILOT_VERBOSE=debug SAGA_VERBOSE=debug RADICAL_VERBOSE=debug is here
agent.log agent.err agent.out