oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

[bug] Bad consumed power when job is killed #51

Closed lccasagrande closed 3 years ago

lccasagrande commented 5 years ago

Description

I've noticed that the consumed power by a killed job is incorrect in some situations.

It appears that this only happens when there is more than one host and the walltime is lower than the runtime. When walltime >= runtime it's ok but when wallltime < runtime the consumed power is wrong.

Setting:

How to reproduce:

I've uploaded the workload and the platform I've used. To check this behavior I recommend three tests:

  1. Execute Batsim and Batsched with the attached workload and platform. The resulting power will be incorrect (1059W). It should be 32032=1218W because the walltime is 3 seconds and it's lower than the runtime, which is 4 seconds.
  2. Change the job walltime to 4 (same as runtime) and execute again. The power will be correct (42032=1624W)
  3. Change the platform to have just one host and change the workload properly. The power will be correct (32031=609W)

batsim_files.zip

Possible fixes

This seems to be a problem in SimGrid, but I'm not sure. Do you have any idea? I can investigate it in this weekend.

mpoquet commented 5 years ago

Hello and thanks for reporting this issue :). I am very busy this week and I will not be able to look at it, but I'll do it next week!

Mommessc commented 5 years ago

Hi,

I changed the values of watt_per_state to see what happens, using 100.0:5000.0 for pstate 1 of machine 0; and 1.0:50.0 for pstate 1 of machine 1. With a walltime of 4, the total energy consumed is 4(5000+50) as expected. With a walltime of 3, the computation becomes 3(5000+1). It seems that only the first machine is computing here (and adding more hosts in the platform and more resource asked for the job gives the same results: 1 host computing and the others idle).

Another point to note is that with a walltime of 4 the job is marked as COMPLETED_WALLTIME_REACHED anyway, but the energy consumption is correct.

Investigations to be continued...

mpoquet commented 5 years ago

Oops, the week was a bit long, sorry about this =/.

I can reproduce the issue. I highly suspect that https://github.com/simgrid/simgrid/issues/189 is related with the current issue. If I remember correctly, the energy consumption of parallel tasks is wrong because the load of single-core machines is not computed correctly (or not used correctly). I think we should fix https://github.com/simgrid/simgrid/issues/189 first, as it may fix the walltime-specific issue as a side effect.

mpoquet commented 5 years ago

We started to work on fixing https://github.com/simgrid/simgrid/issues/189.
From first @Mommessc's tries, the current issue seems unrelated to https://github.com/simgrid/simgrid/issues/189.

augu5te commented 4 years ago

Now that https://github.com/simgrid/simgrid/issues/189 is closed, is this issue accordingly fixed ?

lccasagrande commented 4 years ago

I tested it today and I'm still getting wrong energy/power values.

Setting

Test Result

Here are the results I got from the consumed energy CSV exported by Batsim.

time energy event_type wattmin epower
7.501e-05 0.022503 s 406 300
3.00008 1059.02 e 406 353

The last epower value (353) should've been 406W, which is the power consumed when two hosts are computing. It seems that only one machine is computing while the other one remains idle.

Mommessc commented 4 years ago

I think this is not related to energy/power consumption but a more general problem of how SimGrid handles an execution that will timeout or not. I will create an issue on the Simgrid Github with a MWE.

Mommessc commented 3 years ago

@lccasagrande FYI after almost two years a fix was just made on SimGrid master to solve this problem.