Open fteufel opened 3 months ago
This is a great feature request! Something we've been thinking about for a while now and I think you did a great job at summarizing the possibilities we have here.
There's probably no way to make all of this happen in the polaris source.
You're right!
One connotation: I think we could still improve the metric system, such as by extending it with serializable and modular preprocessing steps. That way you wouldn't create flattened_mcc
, but you would create a flatten
action and a mcc
action and then come up with a system to save a pipeline of such actions as a metric in a benchmark.
Ultimately, however, as the scope of Polaris grows there will always be niche, domain-specific or task-specific metrics that we likely cannot all include in the Polaris source code.
Could a way forward be to allow "metrics as code", implemented following a specified API, to be provided with a benchmark optionally in a .py file?
Yes, this is definitely an interesting possibility, but it's a challenging feature. For such challenging features, we would like to collect some user feedback before we start on implementing them to better understand the requirements. I like that you mention 🤗 ! Such an established product is a good source of inspiration! I'll look into that!
Is your feature request related to a problem? Please describe.
The metrics currently provided by polaris are decent for standard classification/regression tasks, but there are many problems that might require more sophisticated methods for quantifying performance.
As an example, I have a task where each sample is an array, and the labels is an array of the same length. So for each sample, each array position carries a label.
For quantifying performance, we want to measure how many positions we got right over the whole dataset, so I would do something like
Right now, as far as I can tell there is no way to specify such a thing. And I can imagine there are numerous cases that are much more involved than this simple flattening step.
Describe the solution you'd like
There's probably no way to make all of this happen in the polaris source. Submitting a PR each time and having ad hoc things like
flattened_mcc
in the library doesn't sound like a good idea. Could a way forward be to allow "metrics as code", implemented following a specified API, to be provided with a benchmark optionally in a .py file? Executing 3rd party code of course is dangerous. But it's doable, e.g. huggingface handles this by forcing the user to manually settrust_remote_code=True
to use custom models from their hub.Describe alternatives you've considered
Alteratively, just not allow such things in polaris, and commuicate somewhere that only tasks using the available metrics are supported.