golemfactory / yapapi

Python high-level API for Golem.
https://yapapi.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
49 stars 24 forks source link

LeastExpensive Market Strategy not using number of threads of providers #1147

Open RoyKas opened 1 year ago

RoyKas commented 1 year ago

Below section of leastexpensive.py is not using the number of threads of the provider for the expected usage. This results in incorrect selection of providers. Both the job duration and consumed cpu time are set to the same 'expected_usage', which is 60 seconds by default. This is only correct in case the provider is using a single thread. For higher number of threads, the consumed cpu time should be the same multiple of the job duration. Either the job duration should be divider by the number of threads, or the consumed cpu time would need to be multiplied by the number of threads. The first option will favor providers with higher number of threads, the second option will favor provider with lower number of threads. In case both are undesirable, a third option is to adjust both using the square root of the number of threads, or some other in-between solution.

    expected_usage = []

    for resource in linear.usage_vector:
        if linear.price_for[resource] > self._max_price_for[resource]:
            self._logger.debug(
                "Rejected offer %s: price for '%s' higher than price cap %f.",
                offer.id,
                resource,
                self._max_price_for[resource],
            )
            return SCORE_REJECTED

        if linear.price_for[resource] < 0:
            self._logger.debug("Rejected offer %s: negative price for '%s'", offer.id, resource)
            return SCORE_REJECTED

        expected_usage.append(self._expected_time_secs)
krunch3r76 commented 1 year ago

first and foremost, thanks for flagging this - such observations are crucial for the community to evolve and refine the project.

the essence of CPU time is that it aggregates the compute cycles used, regardless of the number of cores at play. even when multiple cores are working in parallel, their individual compute efforts, when tallied, present a consistent picture i.e. add up to the same total CPU time. put plainly, CPU time is divided across all the cores. these references can provide more granularity:

[1] CPU time formula [2] CPU Second definition

i'm just another contributor like you, aiming to bring more clarity and efficiency to the project. your insights are invaluable, and i appreciate the time you took to articulate this.

shadeofblue commented 1 year ago

@RoyKas well, the default strategy is by necessity a simplification and does take liberty to err on one side or another in those specific cases that you described...

The assumption here is that given the fact that it's hard to arrive at an uncomplicated solution that covers all bases - as you have observed yourself - we're providing something that tries to give a more-or-less sensible score without differentiating between single-threaded vs multi-threaded tasks ...

As an app developer, you're free to implement a strategy that will give a better score for a scenario that you know you will be running...

krunch3r76 commented 1 year ago

expected_time_secs is fundamental to the scoring function when comparing offers, as it essentially serves as the sole scaling factor for CPU time and duration. to understand expected_time_secs role, let's examine relevant code snippets:

in the code:

https://github.com/golemfactory/yapapi/blob/b89b1323bf156466d884dc938937e36ccbdc8d75/yapapi/strategy/least_expensive.py#L66-L84

expected_usage is populated with the value of expected_time_secs corresponding to the linear_coeffs, which are duration and cpu, where order is unimportant. this list is then passed to the calculate_cost method of the ComLinear class:

https://github.com/golemfactory/yapapi/blob/b89b1323bf156466d884dc938937e36ccbdc8d75/yapapi/props/com.py#L79-L81

here, the first two coefficients from linear_coeffs are multiplied by its corresponding value in expected_usage, which, in this context, is a measure of CPU seconds scaled by default 60 to 1, duration is scaled for 60 to 1, and start 1 to 1. the results are summed.

after tracing the logic, i see your point that more threads should give a lower score when duration is less since expected time would be relatively less for duration vs cpu time. however, this strategy does not distinguish the two. as @shadeofblue said, a custom strategy, modeled after this one, would need to be developed.

and even scoring 1 core providers, clock speed may differ. then even multithreaded comparisons might need to factor clock speed. would 8 threads at 3GHz finish any faster than 4 threads at 6GHz, for example. . .

thanks for reading along, great discussion!

RoyKas commented 1 year ago

The point I'm trying to make is that the usage vector is populated with _expected_tim_sec, both for duration and for cpu.seconds, and therefore it implies a single core/threaded processor. For processors with more cores/threads the pricing for duration and for cpu-seconds will be weighed differently, e.g. cpu.seconds will be approximately an integer multiple of the duration, assuming the task is fully loading the processor. I see that a solution will not be straightforward, but the code should be such that it is clear to requestors that this default market strategy is using this simplification.

krunch3r76 commented 1 year ago

good point. the docstring could make it explicit that the parameter expected_time_sec is a scaling factor for both env/hr and cpu/hr. perhaps if someone, like you or myself, were to add that to the docstring and submit a pull request, the issue would be closed as satisfactorily resolved when merged - after comments and subsequent revisions perhaps?

RoyKas commented 11 months ago

I haven't done any such code changes or pull requests before, would you be willing to take this on you?

krunch3r76 commented 11 months ago

I haven't done any such code changes or pull requests before, would you be willing to take this on you?

good to know that would resolve the issue! i will see what i can come up with in the near future! 😊