Improve separation of concerns in tests

Continue work started in this PR:

https://github.com/mlcommons/modelgauge/pull/560

Goal: decouple test definitions from test operations.

This is step 2 of potentially 6. I think this will enable faster iterations and integration of different models and evaluation functions, and I'd be OK leaving it like this if that's all we need. But if we do have additional needs, here are the steps.

Step 1 was making the private annotators' public interface more consistent across annotators, to make it easier and less risky to add new ones to modelgauge as they were being developed and iterated on in the private repo. This was done loosely, i.e. there's no actual API contract with ABC or interfaces the annotators need to implement; just enough to make it easier for engineering to integrate into modelgauge quickly and safely enough, and easy enough for non-engineers to adhere to without requiring more advanced Python features like ABCs or refactoring their existing annotators.

Step 2 (this) is to hide the internals of annotators and expose a simple interface to client code like safe tests, so that engineering can program to that interface. This keeps our client code cleaner, more readable, and testable, and easier to reason about. More interestingly, it allows people to create their own arbitrary groups of annotators and ensemble evaluation functions without engineering having to do much work to add them. They just create an AnnotatorSet with whatever combination of annotators evaluation function, including their configuration, and we can pull it in and use it in tests. So if you have annotators A, B, C, D, and E, you can run A,B,C together, A,B,C,D,E together, A,B,E together, etc. with minimal effort.

Possible future steps if we find they are needed:

a more generic way to inject the ensemble evaluation function into the AnnotatorSet, like a Strategy pattern. Currently it's hardcoded to EnsembleAnnotator.simple_join_evaluator_responses because that's what we have. This PR makes it relatively easy to support either 1) an arbitrary method in the EnsembleAnnotator class (not just majority vote) or 2) a different EnsembleAnnotator object altogether instead. enforce the structure of annotators in code (e.g. provide ABCs) rather than convention ("hey so-and-so, make sure your client module includes a Config-like object to help us integrate it"). Some classes already have ABCs to enforce this, but the ensemble model wrappers that use them don't. a declarative way to create annotator bundles. This would increase complexity and I don't know if there would be much benefit, but conceptually it's the logical termination of this work. E.g. instead of writing EnsembleAnnotatorSet and all its annotator loading and scoring logic in Python, we could expose a way for others to create arbitrary bundles declaratively in yaml or json or whatever, specifying their package names and evaluation functions as fully qualified names, authentication parameters in a requirements section (kind of like wants in systemd), etc. Another thing we may consider:

instead of model client code managing third-party configuration explicitly (secrets, hostnames, endpoints, etc), we could conceivably use mixins for families of models, e.g. instead of MISTRAL_8x22B_CONFIG.llm_config.api_key = self.configuration["together_api_key"] LLAMA_3_70B_CONFIG.llm_config.api_key = self.configuration["together_api_key"] we could make the Mistral and Llama3 clients implement a Together mixin, and likewise have a HuggingFace mixin for services hitting models on HF, a VLLM mixin for models we host, etc.

Is it possible we could improve things here by separating concerns? One of my ongoing struggles with the Tests is that they do all of these, and maybe more

express the formal definition of a benchmark component and its subcomponents that needs to be stable for reproducibility reasons instantiate and manage those subcomponents perform a variety of evaluation-time actions manage and call (or trigger the calling of) a variety of external APIs, hiding the operational details For a while, I've found the secrets stuff especially annoying, which is why ModelBench has the ModelGaugeSut, a wrapper that allows us to refer to a ModelGauge SUT without having to fully instantiate it. But over time, I'm finding the operational issues at least as much of a headache. For example, if many SUTs are calling the same Together API, to get maximum speed I need to centrally manage the connections to respect rate limits. Or looking at the HuggingFace stuff, having to manage instances and possibly wait many minutes for them shouldn't be hidden away in a variety of Test subcomponents; that's something better handled at a high level. That will become even more of an issue once we're using our own annotator VMs in earnest.

I'm not sure what the right breakdown is, but I'm hoping we can keep the Test objects themselves to doing the defining and the straightforward calculations. And it would be great if we could make things more composable, too. Although we do eventually want a locked-down set of Test objects to go with a locked-down set of Benchmarks, for now we want to iterate quickly.

mlcommons / modelgauge

Improve separation of concerns in tests #570