Closed ringleschavez closed 2 years ago
C.E.C.I.'s Slurm F.A.Q. taken from HPC cluster: select the number of CPUs and threads in SLURM sbatch
--ntasks=16
--ntasks=16
--ntasks=16 and --ntasks-per-node=1
or --ntasks=16 and --nodes=16
--ntasks=16 --nodes=16 --exclusive
--ntasks=16 --ntasks-per-node=2
--ntasks=16 --ntasks-per-node=16
--ntasks=1 --cpus-per-task=16
--ntasks=4 --cpus-per-task=4
Having executed the ${MULTISCALETVBNEST}/launcher/tests/plans/simple_plan_on_cluster.xml on JUWELS, it has been noticed that some ERROR messages were thrown. NEVERTHELESS, slurm tool commands send some informative message to the stderr, i.e. the Co-Simulator reports such message as ERROR because they were gotten from the stderr buffer.
2021-06-21 14:20:54,834 - INFO - common.cosimulator - [Spawner-2:20460] - action_004: PPID=24823,PID=24829,MPI.COMM_WORLD.size=2,MPI.COMM_WORLD.rank=0,MPI.processor_name=jwc00n014.juwels
2021-06-21 14:20:54,834 - ERROR - common.cosimulator - [Spawner-2:20460] - action_004: srun: job 3884042 queued and waiting for resources
2021-06-21 14:20:54,834 - INFO - common.cosimulator - [Spawner-2:20460] - action_004: PPID=24824,PID=24830,MPI.COMM_WORLD.size=2,MPI.COMM_WORLD.rank=1,MPI.processor_name=jwc00n014.juwels
2021-06-21 14:20:54,834 - ERROR - common.cosimulator - [Spawner-2:20460] - action_004: srun: job 3884042 has been allocated resources
2021-06-21 14:20:55,037 - INFO - common.cosimulator - [Spawner-1:20459] - action_006: PPID=16355,PID=16360,MPI.COMM_WORLD.size=2,MPI.COMM_WORLD.rank=0,MPI.processor_name=jwc00n004.juwels
2021-06-21 14:20:55,037 - ERROR - common.cosimulator - [Spawner-1:20459] - action_006: srun: job 3884043 queued and waiting for resources
2021-06-21 14:20:55,037 - INFO - common.cosimulator - [Spawner-1:20459] - action_006: PPID=16356,PID=16362,MPI.COMM_WORLD.size=2,MPI.COMM_WORLD.rank=1,MPI.processor_name=jwc00n004.juwels
2021-06-21 14:20:55,037 - ERROR - common.cosimulator - [Spawner-1:20459] - action_006: srun: job 3884043 has been allocated resources
2021-06-21 14:20:55,039 - INFO - common.cosimulator - [Spawner-2:20460] - action_004: PPID=24824, PID=24830, Cosimulation_outputs/ingleschavez1_outputs_2021-06-21_142021/results/simple_test/24830.output has been generated
2021-06-21 14:20:55,039 - INFO - common.cosimulator - [Spawner-2:20460] - action_004: PPID=24823, PID=24829, Cosimulation_outputs/ingleschavez1_outputs_2021-06-21_142021/results/simple_test/24829.output has been generated
2021-06-21 14:20:55,039 - INFO - common.cosimulator - [Spawner-2:20460] - Action <action_004> finished properly.
2021-06-21 14:20:55,039 - INFO - common.cosimulator - [Spawner-2:20460] - PPID=20458,PID=20460,Spawner-2: the <action_004> action has finished
The issue #112 has been created based on the already dropped task:
Summary
Tasks
test_translator_nest_to_tvb.sh
(@sontheimer implementaion) by means of the Co-Simulator on the SCsRequirements
Acceptance criteria
NOTE
The issue to accomplish the test_co_sim.sh use-case mentioned above on HPC systems, speficically on FZJ/JSC infrastructure, the #93 is the issue create for following-up that regard.