Open bbudescu opened 1 year ago
The problem here is that I imagine that the optimizer might get confused by the reporting time being included in the cost of evaluation, and starts making bad decisions because it's trying to get away with less costly evaluations, which aren't even going to be less costly anyway (e.g., the same trial will show different costs depending on whether it's run as the first or, say, 20,000th trial).
Exposing cost_attr
should be OK. Please feel free to make a PR.
The initializer of the
BlendSearch
class takes acost_attr
parameter that allows the user to specify which one of the reported metrics to consider to be the cost when running optimization.When calling
tune.run
, thecost_attr
parameter is always assigned the default value ofauto
, which falls back to the reported"time_total_s"
.I know, one obvious solution would be to instantiate the
BlendSearch
class with the desired custom value for thecost_attr
param before calling intotune.run
, and just pass the instance as thesearch_alg
param. However, that's not as convenient, especially since there is a tiny bit of preprocessing happening withintune.run
on the params before being passed toBlendSearch.__init__
(like assigning some default values, for example). This preprocessing would need to be copied in thetune.run
caller scope, which might lead to getting out of sync with the upstream code.As a sidenote, here's the reason for which I want to use a custom
cost_attr
. My cost is the time required to evaluate the configuration, sotime_total_s
should have been precisely what I needed. However, it turns outtime_total_s
also includes the time required to calltune.report
. Now, interestingly enough, after running a lot of trials (like 20k-30k or so), this duration gradually increases up to the point where it overwhelmingly dominates the occupancy of the ray processes running the evaluation. The ray worker process ends up spending something like 10 times more time waiting for results to get reported than actually running the evaluation. I haven't measured an pinpointed this exactly, but I suspect this is happening because, as the number of trial results increases, it takes more and more time to fit the TPE model on the ever larger amount data.ray.report
blocks until fitting previous results is finished.One way to address this would be to update the model less frequently than upon every single trial result being received, so maybe only update it once for all the results that were accumulated in a queue or something since the last model update. Even better, the frequency of TPE model fitting can be further decreased until the worker waiting time reaches something reasonable like, say, under 10% of the total time.
Perhaps there are other ways to address this, or perhaps this is inherent to the TPE algorithm itself. I guess a good first step would be to profile the guiding process running on the head node, because I'm not even sure this is where the bottleneck occurs. It might even be somewhere else altogether, e.g., within the local search code (CFO). If it indeed is the TPE that's slowing things down, maybe another way to go would be to make the TPE fitting process use multiple threads, e.g., the ones waiting for a new suggestion to be made, or maybe a GPU or something.