oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Job MSG bug with computation vector values at 0 #23

Closed lbarallon closed 5 years ago

lbarallon commented 7 years ago

When using a MSG profile with batexec. If the communication matrice is empty, the computation vector must have all his values different from 0, else if :

config yaml profile msg

mpoquet commented 7 years ago

Thanks for the report! I can reproduce the error. However the infinite loop condition is not that simple :(.

What I did

I generated all possible 4-node MSG jobs with 0-filled communication matrices and 0|1-filled computation vectors thanks to python magic:

import itertools

s = sorted({x for x in itertools.product([0,1], repeat=4)})

for a,b,c,d in s:
    print('- {{"name":"j{a}{b}{c}{d}", "cpu_json":"[{a},{b},{c},{d}]"}}'.format(a=a,b=b,c=c,d=d))

I then executed all combinations (thanks to this yaml file) with this command:

export BATSIM_DIR=/path/to/batsim/dir
${BATSIM_DIR}/tools/experiments/execute_instances.py issue23.yaml

Expected output:

[...]
2017-05-05 15:57:37,786 INFO: Number of successfully executed instances: 8
2017-05-05 15:57:37,786 WARNING: Number of skipped instances: 8

Then

${BATSIM_DIR}/tools/experiments/execute_instances.py issue23.yaml --post_only
``` json
Instances that worked
j0001
j0011
j0101
j0111
j1001
j1011
j1101
j1111

Instances that failed (division by zero)
j0000

Instance IDs that reached timeout
4a393fc2
8bc8dcdd
9b63f3bb
b6b50799
bc9c3608
daacb058
e4710586

###########
# SUMMARY #
###########
nb_instances: 16
nb_worked: 8
nb_fail_div0: 1
nb_timeout: 7
nb_uncovered:0

Conclusions

0-filled communication matrix with:

mpoquet commented 7 years ago

Opened https://github.com/simgrid/simgrid/issues/165

mpoquet commented 7 years ago

https://github.com/simgrid/simgrid/issues/165 fixed thanks to @frs69wq.

I will close the bug once we will be able to use the up-to-date SimGrid version (when SMPI replay multiple on steroids will work).

mquinson commented 7 years ago

Hello,

maybe you can cherry-pick https://github.com/simgrid/simgrid/commit/857e3b2171472a3d6db57944a99f8986aff18b0a ?

mpoquet commented 7 years ago

Hello and thanks! Cherry-picked simgrid/simgrid@857e3b2 in ba2856c1e.

Only the 0-filled case does not pass now, with a quite clean error. I think the error will not happen at all in up-to-date SimGrid.

We plan to go back to the main SG branch soon. I'll try to make a clean SG example of our dynamicity needs this week and I know @Shurakai plans to improve SMPI dynamicity soon.

2017-05-16 01:39:46,667 INFO: Base working directory: /home/carni/proj/test/batissue23
2017-05-16 01:39:46,667 INFO: Base output directory: /home/carni/proj/test/batissue23/out
2017-05-16 01:39:46,667 INFO: Executing command 'command0'
2017-05-16 01:39:46,708 INFO: command0 finished
2017-05-16 01:39:46,708 INFO: Executing command 'command1'
2017-05-16 01:39:46,757 INFO: command1 finished
2017-05-16 01:39:46,840 INFO: Worker (localhost,0) got 5bb525b6 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1101', 'cpu_json': '[1,1,0,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:46,840 INFO: Worker (localhost,0) runs 5bb525b6
2017-05-16 01:39:47,028 INFO: Worker (localhost,0) finished 5bb525b6
2017-05-16 01:39:47,049 INFO: Worker (localhost,0) got 4995eedd ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0101', 'cpu_json': '[0,1,0,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:47,049 INFO: Worker (localhost,0) runs 4995eedd
2017-05-16 01:39:47,237 INFO: Worker (localhost,0) finished 4995eedd
2017-05-16 01:39:47,258 INFO: Worker (localhost,0) got c5a3c99f ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0011', 'cpu_json': '[0,0,1,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:47,258 INFO: Worker (localhost,0) runs c5a3c99f
2017-05-16 01:39:47,447 INFO: Worker (localhost,0) finished c5a3c99f
2017-05-16 01:39:47,477 INFO: Worker (localhost,0) got 8bc8dcdd ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0010', 'cpu_json': '[0,0,1,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:47,478 INFO: Worker (localhost,0) runs 8bc8dcdd
2017-05-16 01:39:47,683 INFO: Worker (localhost,0) finished 8bc8dcdd
2017-05-16 01:39:47,704 INFO: Worker (localhost,0) got daacb058 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0110', 'cpu_json': '[0,1,1,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:47,704 INFO: Worker (localhost,0) runs daacb058
2017-05-16 01:39:47,893 INFO: Worker (localhost,0) finished daacb058
2017-05-16 01:39:47,918 INFO: Worker (localhost,0) got 778f453d ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1001', 'cpu_json': '[1,0,0,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:47,918 INFO: Worker (localhost,0) runs 778f453d
2017-05-16 01:39:48,108 INFO: Worker (localhost,0) finished 778f453d
2017-05-16 01:39:48,129 INFO: Worker (localhost,0) got 924f3020 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0111', 'cpu_json': '[0,1,1,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:48,129 INFO: Worker (localhost,0) runs 924f3020
2017-05-16 01:39:48,330 INFO: Worker (localhost,0) finished 924f3020
2017-05-16 01:39:48,360 INFO: Worker (localhost,0) got 4a393fc2 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1110', 'cpu_json': '[1,1,1,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:48,360 INFO: Worker (localhost,0) runs 4a393fc2
2017-05-16 01:39:48,568 INFO: Worker (localhost,0) finished 4a393fc2
2017-05-16 01:39:48,593 INFO: Worker (localhost,0) got bc9c3608 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1100', 'cpu_json': '[1,1,0,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:48,593 INFO: Worker (localhost,0) runs bc9c3608
2017-05-16 01:39:48,798 INFO: Worker (localhost,0) finished bc9c3608
2017-05-16 01:39:48,819 INFO: Worker (localhost,0) got e4710586 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0100', 'cpu_json': '[0,1,0,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:48,819 INFO: Worker (localhost,0) runs e4710586
2017-05-16 01:39:49,051 INFO: Worker (localhost,0) finished e4710586
2017-05-16 01:39:49,072 INFO: Worker (localhost,0) got bea187f9 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0000', 'cpu_json': '[0,0,0,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:49,072 INFO: Worker (localhost,0) runs bea187f9
2017-05-16 01:39:49,355 ERROR: Worker (localhost,0) finished bea187f9 (returncode=2).
2017-05-16 01:39:49,370 INFO: 

----- begin of instance bea187f9 log -----
2017-05-16 01:39:49,375 ERROR: Instance bea187f9 stdout:
2017-05-16 01:39:49,199 INFO: Variables = {'base_output_directory': '/home/carni/proj/test/batissue23/out', 'base_working_directory': '/home/carni/proj/test/batissue23', 'batsim_dir': '${BATSIM_DIR}', 'bug_dir': '${base_working_directory}', 'instance_id': 'bea187f9', 'instance_number': 10, 'job': {'cpu_json': '[0,0,0,0]', 'name': 'j0000'}, 'platform': {'filename': '${batsim_dir}/platforms/cluster512.xml', 'name': 'cluster'}, 'working_directory': '/home/carni/proj/test/batissue23', 'output_directory': '/home/carni/proj/test/batissue23/out/results/j0000'}
2017-05-16 01:39:49,199 INFO: Working directory: /home/carni/proj/test/batissue23
2017-05-16 01:39:49,199 INFO: Output directory: /home/carni/proj/test/batissue23/out/results/j0000
2017-05-16 01:39:49,200 INFO: Executing command 'command0'
2017-05-16 01:39:49,208 INFO: command0 finished
2017-05-16 01:39:49,210 INFO: Batsim command: "batsim -p ${platform["filename"]} -w ${output_directory}/workload.json -e ${output_directory}/out -m master_host0 --batexec"
2017-05-16 01:39:49,211 INFO: Running Batsim
2017-05-16 01:39:49,326 ERROR: Batsim finished (returncode=134)
2017-05-16 01:39:49,333 ERROR: Batsim stderr:
[0.000000] [batsim/INFO] Workload '8cfb75' corresponds to workload file '/home/carni/proj/test/batissue23/out/results/j0000/workload.json'.
[0.000000] [workload/INFO] Loading JSON workload '/home/carni/proj/test/batissue23/out/results/j0000/workload.json'...
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 1 jobs and 1 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will NOT be used.
[0.000000] [machines/INFO] Creating the machines from platform file '/home/carni/proj/batsim/platforms/cluster512.xml'...
[0.000000] [machines/INFO] The name of the master host is 'master_host0'
[0.000000] [machines/INFO] The name of the parallel file system host is 'pfs_host'
[0.000000] [xbt_cfg/INFO] Switching to the L07 model to handle parallel tasks.
[0.000000] [machines/INFO] There is not Pfs_Host (parallel filesystem host).
[0.000000] [machines/INFO] The machines have been created successfully. There are 512 computing machines.
[0.000000] [batsim/INFO] Batsim's export prefix is '/home/carni/proj/test/batissue23/out/results/j0000/out'.
[0.000000] [batsim/INFO] The process 'workload_submitter_8cfb75' has been created.
[a0:job8cfb75!1:(2) 0.000000] [jobs_execution/INFO] Creating task 'p 1'j0000''
[a0:job8cfb75!1:(2) 0.000000] [jobs_execution/INFO] Executing task 'p 1'j0000''
[a0:job8cfb75!1:(2) 0.000000] [jobs_execution/INFO] Task 'p 1'j0000'' finished
[a0:job8cfb75!1:(2) 0.000000] [jobs_execution/INFO] Job 8cfb75!1 finished in time
[0.000000] [export/INFO] PajeTracer finalized
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:47: [xbt_exception/CRITICAL] Uncaught exception boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::overflow_error> >: Division by zero.
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:80: [xbt_exception/CRITICAL] Current backtrace:
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> simgrid::xbt::backtrace() at /home/carni/proj/simgrid-mpoquet/src/xbt/backtrace.cpp:79, 0x7f3536770cbf
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> simgrid::xbt::handler() at /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:101, 0x7f35368b3c08
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> __cxxabiv1::__terminate(void (*)()) at /build/gcc-multilib/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:51, 0x7f35347302a6
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> std::terminate() at ??:?, 0x7f35347302f1
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> __cxa_throw at ??:?, 0x7f3534730508
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> void boost::throw_exception<boost::exception_detail::error_info_injector<std::overflow_error> >(boost::exception_detail::error_info_injector<std::overflow_error> const&) at /usr/include/boost/throw_exception.hpp:69 (discriminator 2), 0x56b162
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> void boost::exception_detail::throw_exception_<std::overflow_error>(std::overflow_error const&, char const*, char const*, int) at /usr/include/boost/throw_exception.hpp:86, 0x56b0c9
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> boost::multiprecision::backends::eval_divide(boost::multiprecision::backends::gmp_rational&, boost::multiprecision::backends::gmp_rational const&, boost::multiprecision::backends::gmp_rational const&) at /usr/include/boost/multiprecision/gmp.hpp:2093, 0x56fcbd
[0.000000] /home/carni/proj/simgrid-mpoquet/src/xbt/exception.cpp:82: [xbt_exception/CRITICAL]   -> void boost::multiprecision::number<boost::multiprecision::backends::gmp_rational, (boost::multiprecision::expression_template_option)1>::do_assign<boost::multiprecision::detail::expression<boost::multiprecision::detail::divide_immediates, boost::multiprecision::number<boost::multiprecision::backends::gmp_rational, (boost::multiprecision::expression_template_option)1>, boost::multiprecision::number<boost::multiprecision::backends::gmp_rational, (boost::multiprecision::expression_template_option)1>, void, void> >(boost::multiprecision::detail::expression<boost::multiprecision::detail::divide_immediates, boost::multiprecision::number<boost::multiprecision::backends::gmp_rational, (boost::multiprecision::expression_template_option)1>, boost::multiprecision::number<boost::multiprecision::backends::gmp_rational, (boost::multiprecision::expression_template_option)1>, void, void> const&, boost::multiprecision::detail::divide_immediates const&) at /usr/include/boost/multiprecision/number.hpp:745, 0x56fc4f
/home/carni/proj/test/batissue23/out/results/j0000/batsim_command.sh : ligne 7 : 23654 Abandon                 (core dumped)batsim -p ${platform["filename"]} -w ${output_directory}/workload.json -e ${output_directory}/out -m master_host0 --batexec

2017-05-16 01:39:49,379 INFO: ----- end of instance bea187f9 log -----

2017-05-16 01:39:49,389 INFO: Worker (localhost,0) got aa7d02b6 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j0001', 'cpu_json': '[0,0,0,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:49,389 INFO: Worker (localhost,0) runs aa7d02b6
2017-05-16 01:39:49,609 INFO: Worker (localhost,0) finished aa7d02b6
2017-05-16 01:39:49,630 INFO: Worker (localhost,0) got 9b63f3bb ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1010', 'cpu_json': '[1,0,1,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:49,630 INFO: Worker (localhost,0) runs 9b63f3bb
2017-05-16 01:39:49,828 INFO: Worker (localhost,0) finished 9b63f3bb
2017-05-16 01:39:49,849 INFO: Worker (localhost,0) got 4d91af6f ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1011', 'cpu_json': '[1,0,1,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:49,849 INFO: Worker (localhost,0) runs 4d91af6f
2017-05-16 01:39:50,059 INFO: Worker (localhost,0) finished 4d91af6f
2017-05-16 01:39:50,086 INFO: Worker (localhost,0) got b6b50799 ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1000', 'cpu_json': '[1,0,0,0]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:50,087 INFO: Worker (localhost,0) runs b6b50799
2017-05-16 01:39:50,293 INFO: Worker (localhost,0) finished b6b50799
2017-05-16 01:39:50,324 INFO: Worker (localhost,0) got a7a86b6a ({'platform': {'name': 'cluster', 'filename': '${batsim_dir}/platforms/cluster512.xml'}, 'job': {'name': 'j1111', 'cpu_json': '[1,1,1,1]'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-16 01:39:50,324 INFO: Worker (localhost,0) runs a7a86b6a
2017-05-16 01:39:50,535 INFO: Worker (localhost,0) finished a7a86b6a
2017-05-16 01:39:50,548 INFO: Worker (localhost,0) finished
2017-05-16 01:39:50,553 INFO: Number of successfully executed instances: 15
2017-05-16 01:39:50,553 WARNING: Number of skipped instances: 1
2017-05-16 01:39:50,554 WARNING: Information about these instances can be found in file /home/carni/proj/test/batissue23/out/instances/instances_info.csv
mpoquet commented 5 years ago

Seems to work now (1ac8880). Closing this issue.

Command

batsim -p ./platforms/small_platform.xml -w ./zero.json --batexec

Workload

{
  "command:":"",
  "date":"Tue May  2 11:04:04 2017",
  "description":"workload with profile file for test",
  "jobs":[
    {
      "id":1,
      "profile":"1",
      "res":4,
      "subtime":10,
      "walltime":100
    }
  ],
  "nb_res":4,
  "profiles":{
    "1":{
      "com":[
        0, 0, 0, 0,
        0, 0, 0, 0,
        0, 0, 0, 0,
        0, 0, 0, 0
      ],
      "cpu":[
        0, 0, 0, 0
      ],
      "type":"msg_par"
    }
  },
  "version":0
}

Output


platform_filename: ./platforms/small_platform.xml
[0.000000] [batsim/INFO] Workload 'b951eb' corresponds to workload file '/home/carni/proj/batsim/./zero.json'.
[0.000000] [batsim/INFO] Batsim version: v2.0.0-183-gc4d65c0
[0.000000] [workload/INFO] Loading JSON workload '/home/carni/proj/batsim/./zero.json'...
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 1 jobs and 1 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will NOT be used.
[0.000000] [xbt_cfg/INFO] Configuration change: Set 'host/model' to 'ptask_L07'
[0.000000] [machines/INFO] Creating the machines from platform file './platforms/small_platform.xml'...
[0.000000] [surf_parse/INFO] You're using a v4.0 XML file (./platforms/small_platform.xml) while the current standard is v4.1 That's fine, the new version is backward compatible. 

Use simgrid_update_xml to update your file automatically to get rid of this warning. This program is installed automatically with SimGrid, or available in the tools/ directory of the source archive.
[0.000000] [xbt_cfg/INFO] Switching to the L07 model to handle parallel tasks.
[0.000000] [machines/INFO] Looking for master host 'master_host'
[0.000000] [machines/INFO] The machines have been created successfully. There are 4 computing machines.
[0.000000] [batsim/INFO] Batsim's export prefix is 'out'.
[0.000000] [batsim/INFO] The process 'workload_submitter_b951eb' has been created.
[Bourassa:jobb951eb!1:(2) 0.000000] [jobs_execution/INFO] Job b951eb!1 finished in time (success)
[Bourassa:jobb951eb!1:(2) 0.000000] /home/carni/proj/batsim/src/jobs_execution.cpp:501: [jobs_execution/WARNING] Job 'b951eb!1' computed in null time. Putting epsilon instead.
[0.000000] [export/INFO] PajeTracer finalized
[0.000000] [export/INFO] jobs=1, finished=1, success=1, killed=0, success_rate=1.000000
[0.000000] [export/INFO] makespan=0.000010, scheduling_time=0.000000, mean_waiting_time=-10.000000, mean_turnaround_time=-9.999990, mean_slowdown=-999999.000000, max_waiting_time=0.000000, max_turnaround_time=0.000000, max_slowdown=0.000000
[0.000000] [export/INFO] mean_machines_running=0.000010, max_machines_running=0.000010