oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

[bug] wrong job consumed energy when --enable-compute-sharing in activated #65

Closed Mema5 closed 2 years ago

Mema5 commented 2 years ago

Bug description

When Batsim is run with the options --enable-compute-sharing and --energy, and if we have a scheduler that executes several jobs at the same time on the same host, the values reported for energy consumption in the output _jobs.csv are wrong.

In fact, SimGrid only allows to monitor energy consumption at the granularity of a host. The value reported in _jobs.csv for each job is the energy consumption of the entire host during all the time the job is being executed on that host, even if other jobs were sharing compute resources with it. For example, in this _jobs.csv, 3 jobs with 4 parallel tasks each are executed on the same host which is a 12-core machine with max energy 217W (<prop id="wattage_per_state" value="100:100:217"/>). Every job is reported to consume 763840 J which corresponds to execution_time * max_energy = 3520 s * 217 W.

The total energy (as reported in _schedule.csv for example) is correct.

Versions

Possible fixes

We would need to equitably share the energy cost among the jobs, taking into account their execution time and requested_number_of_resources.

mpoquet commented 2 years ago

Hi! Indeed, the energy consumption of jobs does not make much sense. It was added for convenience's sake after a request by David for this article.

Your fix proposal looks like an improvement of the current code (that just returns the energy consumed by the hosts during the job execution), but I am not sure that trying to do anything smart from the raw energy values is a good idea (details in next paragraph). I think this raw value should be kept as-is (the field should be renamed into something clearer?), but if you want to implement your fix as another field of _jobs.csv I'll accept the PR :).

I think that computing a useful value for a job would be interesting, but that it requires a lot of modeling/validation per se. I think this is very tricky as it should resemble a "whole life cost analysis" for each job. On a real platform, many shared resources are used in a job lifecycle. Amortizing and sharing the costs of these resources seems hard, even if we only consider the final part of a job (from the user request to the application being executed, not any data or development needed before the user request). I have in mind the computing and network resources used during the job execution of course, but also any system that enables to put and retrieve data from the platform, the cooling systems... Currently, the energy consumption of the network is not enabled by Batsim (that should change in next release), simulation of I/O is optional and can be used in so many ways (as the user implements its own model and control the simulation via an external decision component, not in Batsim) that it seems impossible to include these in a job energy cost. Thermal stuff simulation is possible with Batsim but using it in this context would require a lot of validation work IMHO.

The consumed_energy field is documented as The total amount of energy (in joules) consumed by the machines during the execution of the job, which seems relevant to the computed value. Maybe a big warning about the usage of this value would be a good addition?

Mema5 commented 2 years ago

I agree with you that allocating an energy value to a job is a tricky question. Though, I might do it eventually for my work and in this case I'll propose it as a PR.

For now: I propose this addition in the doc, that would hopefully remove any ambiguity. I could not find a better (and short) name for the field consumed_energy .

mpoquet commented 2 years ago

Cherry-picked as 005ea64, thanks :).

I'll close this issue for now, feel free to reopen it if you plan to add the feature soon.