mlos_bench: implement optional early abort logic for time based trials

bpkroth commented 1 year ago

Further generalizing this via async telemetry collection during the process might be nice too.

bpkroth commented 11 months ago

Some additional notes:

In cases of throughput or latency based benchmarks, it's not totally clear how to detect whether a specific trial is worse than a previous one, since some trial could theoretically speed up later or some such.

But, for raw time based ones, what we could do, would be to track the worst value seen so far, and then abort if we exceed that. To do that, we'd need some additional metadata that this benchmark was in fact seeking to minimize wallclock time.

What's tricky is how we incorporate metrics from that. Imagine for instance that you wanted to explain why some params/trials were bad. But in aborting some trials, you give up on gathering that data.

Moreover, we can't actually store a real time value for that trial, since we abort it early. Instead we need to store it in the DB as "ABORTED" or somesuch and then each time we train the optimizer fabricate a value for it. Likely $W+\epsilon$ where $W$ is the worst value seen up until that point (i.e., serially examining historical trial data).

bpkroth commented 9 months ago

Per discussions, we need:

a new Environment phase - abort
- the plan will be for users to add that as commands to their Environment configs that instruct the system how to cancel and cleanup a currently running run phase
- that will get executed asynchronously
an additional config option to inform the scheduler when to invoke early abort logic for time based benchmarks
- specifically it needs to know which metrics to look at
- this should probably be per environment
- for instance, right now we don't often teardown the VM for each trial, so the first time the VM gets setup, that necessarily takes longer, so we shouldn't include that in our elapsed time metrics
- could be that we start tracking the elapsed time for every single phase in each environment for each trial and try to infer things, or ...
- we also add a status or telemetry phase that includes commands used to asynchronously poll status of a run phase (or should it also support other phases?) in order to feed in-progress metrics back into the system and allow specifying that one of those (maybe just an implicit elapsed time, but probably not if sometimes the db needs to be reloaded for instance and other times it doesn't so the run phase overall may take longer on occassion even if the actual benchmark portion sometimes doesn't)

microsoft / MLOS

mlos_bench: implement optional early abort logic for time based trials #542