Closed lccasagrande closed 3 years ago
Hello and thanks for reporting this issue :). I am very busy this week and I will not be able to look at it, but I'll do it next week!
Hi,
I changed the values of watt_per_state
to see what happens, using 100.0:5000.0
for pstate 1 of machine 0; and 1.0:50.0
for pstate 1 of machine 1.
With a walltime of 4, the total energy consumed is 4(5000+50) as expected.
With a walltime of 3, the computation becomes 3(5000+1). It seems that only the first machine is computing here (and adding more hosts in the platform and more resource asked for the job gives the same results: 1 host computing and the others idle).
Another point to note is that with a walltime of 4 the job is marked as COMPLETED_WALLTIME_REACHED
anyway, but the energy consumption is correct.
Investigations to be continued...
Oops, the week was a bit long, sorry about this =/.
I can reproduce the issue. I highly suspect that https://github.com/simgrid/simgrid/issues/189 is related with the current issue. If I remember correctly, the energy consumption of parallel tasks is wrong because the load of single-core machines is not computed correctly (or not used correctly). I think we should fix https://github.com/simgrid/simgrid/issues/189 first, as it may fix the walltime-specific issue as a side effect.
We started to work on fixing https://github.com/simgrid/simgrid/issues/189.
From first @Mommessc's tries, the current issue seems unrelated to https://github.com/simgrid/simgrid/issues/189.
Now that https://github.com/simgrid/simgrid/issues/189 is closed, is this issue accordingly fixed ?
I tested it today and I'm still getting wrong energy/power values.
Here are the results I got from the consumed energy CSV exported by Batsim.
time | energy | event_type | wattmin | epower |
---|---|---|---|---|
7.501e-05 | 0.022503 | s | 406 | 300 |
3.00008 | 1059.02 | e | 406 | 353 |
The last epower
value (353) should've been 406W, which is the power consumed when two hosts are computing. It seems that only one machine is computing while the other one remains idle.
I think this is not related to energy/power consumption but a more general problem of how SimGrid handles an execution that will timeout or not. I will create an issue on the Simgrid Github with a MWE.
@lccasagrande FYI after almost two years a fix was just made on SimGrid master to solve this problem.
Description
I've noticed that the consumed power by a killed job is incorrect in some situations.
It appears that this only happens when there is more than one host and the walltime is lower than the runtime. When walltime >= runtime it's ok but when wallltime < runtime the consumed power is wrong.
Setting:
How to reproduce:
I've uploaded the workload and the platform I've used. To check this behavior I recommend three tests:
batsim_files.zip
Possible fixes
This seems to be a problem in SimGrid, but I'm not sure. Do you have any idea? I can investigate it in this weekend.