oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Wrong computation time for multicore execution after a sleep #66

Closed Mema5 closed 2 years ago

Mema5 commented 2 years ago

Hello, Today I bumped into an issue with my experiments, I don't know yet if it's a bug from my side, Batsim or Simgrid.

Bug description

I have a two identical jobs with 4 parallel tasks each, submitted at t=0 and t=5000.

Expected behavior: job0 and job1 should have the same execution time. In fact, here, job1 takes exactly 4 times longer than job0, which corresponds to its number of parallel executors. As if machine0 was not running multicore anymore after rebooting...

Versions

Logs

batsim.log with verbosity debug for this experiment.

mpoquet commented 2 years ago

Mmh I confirm that I can see the same thing that you described by reading Batsim logs.

Potential issues:

Can you try to write a small SimGrid code to detect if the error comes from SimGrid or Batsim? You can for example use mwe.cpp from SimGrid's issue 37 as a base.

Mema5 commented 2 years ago

Hi @mpoquet and thanks for your answer. My platform file and workload file are pretty straightforward, I pasted them at the end of this post.

I will try to identify if the problem comes from Simgrid or Batsim and follow up.

Workload file:

{
    "description": "Test binpacking.",
    "nb_res": 12, 
    "jobs": [
        {"id": "0", "profile": "blast_vm_large", "res": 4, "subtime": 0},
        {"id": "1", "profile": "blast_vm_large", "res": 4, "subtime": 5000}
    ],
    "profiles": {
        "blast_vm_large": {"com": 0.0, "cpu": 1.657216e14, "type": "parallel_homogeneous_total"}
    }
}

Platform file:

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<config id="General">
        <prop id="contexts/stack-size" value="16"></prop>
        <prop id="contexts/guard-size" value="0"></prop>

</config>

<zone id="toy_g5k"  routing="Full">
    <host id="master_host" speed="100Mf">
        <prop id="wattage_per_state" value="100:100:200" />
        <prop id="wattage_off" value="10" />
    </host>

        <host id="taurus_0" core="12" pstate="0" speed="11.77Gf, 1e-9Mf, 0.166666666666667f, 0.006666666666667f">
            <prop id="wattage_per_state" value="100:100:217, 9.75:9.75:9.75, 100:100:100, 125:125:125" />
            <prop id="wattage_off" value="10" />
            <prop id="sleep_pstates" value="1:2:3" />
        </host>
        <host id="taurus_1" core="12" pstate="0" speed="11.77Gf, 1e-9Mf, 0.166666666666667f, 0.006666666666667f">
            <prop id="wattage_per_state" value="100:100:217, 9.75:9.75:9.75, 100:100:100, 125:125:125" />
            <prop id="wattage_off" value="10" />
            <prop id="sleep_pstates" value="1:2:3" />
        </host>
</zone>
</platform>
Mema5 commented 2 years ago

Well intuited @mpoquet, it comes from simgrid.

I submitted the issue on SimGrid's repo : issue #95.

Mommessc commented 2 years ago

It seems the SimGrid issue was fixed, can we close this one too?

mpoquet commented 2 years ago

I also think that SimGrid#95 fixed this issue, but the following should be done before marking this issue as closed.

NB: A new SimGrid version should be released very soon.

Mema5 commented 2 years ago

I can confirm that this bug is fixed with batsim 4.1.0 and simgrid 3.31.0, which is the latest Batsim release on NUR-Kapack.