oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

[bug] Segfault after job completion when the gateway is wrong #57

Closed TheElectronWill closed 3 years ago

TheElectronWill commented 3 years ago
  1. Take a simple platform with a cluster of nodes + a master node.
  2. Create a zoneRoute between the two. By mistake, swap gw_src and gw_dst.
  3. Launch batsim with some jobs

=> Result: batsim crashes with segmentation fault once a job is finished. I would expect at least a warning when the platform is loaded or when the routing is used. This would make such mistakes easier to find.

platform:

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "https://simgrid.org/simgrid.dtd">
<platform version="4.1">
  <zone id="world" routing="Full">
    <!-- compute nodes -->
    <cluster id="cluster_crossbar" router_id="router_cb"
       prefix="node" radical="0-4" suffix=""
       speed="1Gf" bw="125MBps" lat="50us"  bb_bw="2.25GBps" bb_lat="500us">
    </cluster>

    <!-- master node -->
    <cluster id="cluster_master" router_id="router_master"
      prefix="master" radical="0-0" suffix="" speed="1Gf" bw="125MBps" lat="50us">
      <prop id="role" value="master"/>
    </cluster>

    <link id="backbone" bandwidth="1.25Gbps" latency="50us"/>
    <zoneRoute src="cluster_crossbar" dst="cluster_master" gw_src="router_master" gw_dst="router_cb"> <!-- !!! -->
      <link_ctn id="backbone"/>
    </zoneRoute>
  </zone>
</platform>

workload:

{
    "nb_res": 4,
    "jobs": [
        {
            "id": 1,
            "subtime": 1,
            "walltime": 100,
            "res": 4,
            "profile": "delay"
        }
    ],
    "profiles": {
        "hg_10": {
            "type": "parallel_homogeneous",
            "cpu": 1000000000.0,
            "com": 0
        },
        "delay": {
            "type": "delay",
            "delay": 20
        }
    }
}

Versions

Logs

[node0:job_w0!1:(5) 21.005200] [jobs_execution/INFO] Job 'w0!1' finished in time (success)
[node0:job_w0!1:(5) 21.005200] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'job_w0!1' to 'server' of type 'JOB_COMPLETED' with data 0x15371b0
Segmentation fault.

There is no stack trace.

Possible fixes
It would be nice to detect wrong gateways and emit an error/warning, either when loading the configuration or when using the gateway for the first time.

mpoquet commented 3 years ago

Hello and thank you for reporting! It looks like a SimGrid issue for us, we will investigate and keep you posted.

mpoquet commented 3 years ago

Reported to SimGrid: https://framagit.org/simgrid/simgrid/-/issues/71

mpoquet commented 3 years ago

This is now solved in SimGrid when the platform is loaded. Problem should disappear in Batsim by upgrading SimGrid to commit fe620eda26 or more recent, thus closing this issue. Thanks again for your report :).

TheElectronWill commented 3 years ago

You're welcome ! Thanks for the quick reply