Support for custom/nonstandard metrics

polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.

Apache License 2.0

92 stars 6 forks source link

Is your feature request related to a problem? Please describe.

The metrics currently provided by polaris are decent for standard classification/regression tasks, but there are many problems that might require more sophisticated methods for quantifying performance.

As an example, I have a task where each sample is an array, and the labels is an array of the same length. So for each sample, each array position carries a label.

For quantifying performance, we want to measure how many positions we got right over the whole dataset, so I would do something like

# flatten (n, seq_len) to (n*seq_len)
y_true = y_true.ravel() 
y_pred = y_pred.ravel()
metric = matthews_corrcoef(y_true, y_pred)

Right now, as far as I can tell there is no way to specify such a thing. And I can imagine there are numerous cases that are much more involved than this simple flattening step.

Describe the solution you'd like

There's probably no way to make all of this happen in the polaris source. Submitting a PR each time and having ad hoc things like flattened_mcc in the library doesn't sound like a good idea. Could a way forward be to allow "metrics as code", implemented following a specified API, to be provided with a benchmark optionally in a .py file? Executing 3rd party code of course is dangerous. But it's doable, e.g. huggingface handles this by forcing the user to manually set trust_remote_code=True to use custom models from their hub.

Describe alternatives you've considered

Alteratively, just not allow such things in polaris, and commuicate somewhere that only tasks using the available metrics are supported.

This is a great feature request! Something we've been thinking about for a while now and I think you did a great job at summarizing the possibilities we have here.

There's probably no way to make all of this happen in the polaris source.

You're right!

One connotation: I think we could still improve the metric system, such as by extending it with serializable and modular preprocessing steps. That way you wouldn't create flattened_mcc, but you would create a flatten action and a mcc action and then come up with a system to save a pipeline of such actions as a metric in a benchmark.

Ultimately, however, as the scope of Polaris grows there will always be niche, domain-specific or task-specific metrics that we likely cannot all include in the Polaris source code.

Could a way forward be to allow "metrics as code", implemented following a specified API, to be provided with a benchmark optionally in a .py file?

Yes, this is definitely an interesting possibility, but it's a challenging feature. For such challenging features, we would like to collect some user feedback before we start on implementing them to better understand the requirements. I like that you mention 🤗 ! Such an established product is a good source of inspiration! I'll look into that!

polaris-hub / polaris