mlos_bench: improved failure handling

Right now, if a trial fails, we simply continue.

This seems reasonable for benchmark environment failures, however, for lower level failures (e.g. VM, OS) that the leaf environments rely upon it can hide config errors that are only visible at runtime that should be addressed.

For instance, an ARM template error will simply loop with 400 errors until the max_iterations count is reached. This isn't helpful.

A couple of thoughts:

Mark certain environments as "tolerating" (perhaps some limited number) of errors in the scripts
- The main ones that I can think of for this are the actual benchmark runs
- This could be a way of working around how to identify which environments to care about that doesn't rely on strange heuristics for the leaf nodes.
- For instance, there are often LocalEnv scripts that may re-parse result files that follow the actual benchmark runs. If those fail, it's likely due to an error in the script, and we should abort early in order to catch it and fix it.
- On the other hand, if the benchmark fails for some reason, it may be due to a problem with the config, which we should inform the optimizer about. Though even there, we may want to limit that to a certain number before we need some manual inspection since it could still indicate a script error.
Special case certain environments as requiring immediate abort. For instance, VMEnv and OSEnv don't really do anything regarding actual application measurements and could probably be assumed to need to fail immediately upon a deploy error.

microsoft / MLOS

mlos_bench: improved failure handling #464