Open lvwerra opened 2 years ago
As per our meeting today, we proposed to have standardized structure for inputs, in dictionary form.
An initial proposal of that structure can be:
{
"references": ,
"predictions": ,
}
With references
and predictions
being of any format (strings, images, numbers, vectors, etc.). I was looking at examples of computer vision metrics and this should work for those as well.
Edge cases:
sources
average
parameter for multiclass labels, but we could define a default if neededI think we could have additional, optional, source
and average
inputs, but I don't really know what to so for perplexity :sweat_smile: (I mean, in any case, the metrics will not function without these arguments, but I guess waiting for them to crash isn't the best solution)
CC @lvwerra @lhoestq @apsdehal
Just to distinguish the two cases: sources
is part of the input data to the metric (one source per sample), and average
is a mandatory parameter of the metric in the multilabel setting.
I think it's fine if we allow certain metrics to have extra inputs like sources
- but it must be possible to know that in advance, maybe we can provide a property to the Metric
object that lists the expected inputs, and extra inputs should have standard names as well.
It's also fine to require certain metric parameters depending on the task like for F1, we can define that when we start to map metrics to tasks.
Regarding perplexity I guess it will depend on how it's used:
Sorry, indeed, I should have distinguished inputs and parameters :)
On Mon, Apr 11, 2022 at 6:00 AM Quentin Lhoest @.***> wrote:
Just to distinguish the two cases: sources is part of the input data to the metric (one source per sample), and average is a mandatory parameter of the metric in the multilabel setting.
I think it's fine if we allow certain metrics to have extra inputs like sources - but it must be possible to know that in advance, maybe we can provide a property to the Metric object that lists the expected inputs, and extra inputs should have standard names as well.
It's also fine to require certain metric parameters depending on the task like for F1, we can define that when we start to map metrics to tasks.
Regarding perplexity I guess it will depend on how it's used:
- Is it a tool to evaluate a model based on its predictions ? Does it have to fit the other metrics formats to be used in automated evaluations ? In this case it should probably take a model output as metric input, maybe the logits as "predictions" ?
- it is a measurement tool for data ? In this case the current implementation with "input_texts" is the most convenient.
— Reply to this email directly, view it on GitHub https://github.com/huggingface/evaluate/issues/10#issuecomment-1094846962, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMMIIWGLXFIHBVFA7GKZE3VEPZ2TANCNFSM5SWIFFNQ . You are receiving this because you commented.Message ID: @.***>
-- Sasha Luccioni, PhD Postdoctoral Researcher (Mila, Université de Montréal) Chercheure postdoctorale (Mila, Université de Montréal) https://www.sashaluccioni.com/ [image: Image result for universite de montreal logo]
Also regarding perplexity:
The currently implemented version aims at being a data measurement tool if I understand correctly - you provide text and model. Although one could evaluate a model in that setting it would be less efficient as it requires an additional forward pass. So we could move this version of perplexity to a new folder (e.g. measurements
) and load it with load_measurement("perplexity")
in the future. Since these tools will measure different modalities should we call it input_data
to make it agnostic?
Sure, that could work! I mean, having two separate versions of the code seems a bit redundant, but I agree that the way it's implemented now makes it stand out from the other metrics :)
On Tue, Apr 12, 2022 at 8:05 AM Leandro von Werra @.***> wrote:
Also regarding perplexity: The currently implemented version aims at being a data measurement tool if I understand correctly - you provide text and model. Although one could evaluate a model in that setting it would be less efficient as it requires an additional forward pass. So we could move this version of perplexity to a new folder (e.g. measurements) and load it with load_measurement("perplexity") in the future. Since these tools will measure different modalities should we call it input_data to make it agnostic?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/evaluate/issues/10#issuecomment-1096638380, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMMIIQCGMJ6D5YXY4FTGWTVEVRHLANCNFSM5SWIFFNQ . You are receiving this because you commented.Message ID: @.***>
-- Sasha Luccioni, PhD Postdoctoral Researcher (Mila, Université de Montréal) Chercheure postdoctorale (Mila, Université de Montréal) https://www.sashaluccioni.com/ [image: Image result for universite de montreal logo]
IIRC we discussed for the postprocessing step that a more flexible alternative is to add a method to the class such that we could do:
>>> metric = load_metric("bleu")
>>> metric.output_transform(lambda x: x["bleu"])
...
>>> metric.compute()
7.4 # instead of default {"bleu": 7.4, "precisions": 0.4, "brevity_penalty": ...}
Is that what you had in mind @lhoestq?
@sashavor did you see many metrics that violated the proposed format? I think most of them are in that format since it used to be enforced? Then the next thing is to go one level deeper and look at what's in references
and predictions
in the individual metrics to make them uniform, too?
The only metrics that violate the proposed format are in the Edge cases in my comment above -- all the other metrics that we currently have are compatible with it.
But yes, the next step would be standardizing the references
and predictions
format to... a dictionary as well?
IIRC we discussed for the postprocessing step that a more flexible alternative is to add a method to the class such that we could do:
metric = load_metric("bleu") metric.output_transform(lambda x: x["bleu"]) ... metric.compute() 7.4 # instead of default {"bleu": 7.4, "precisions": 0.4, "brevity_penalty": ...} Is that what you had in mind @lhoestq?
Yup, we can even name the method .postprocess(...)
^^
But yes, the next step would be standardizing the references and predictions format to... a dictionary as well?
We not keep lists of strings or floats directly. Or use dictionaries if we feel like having explicit named fields like texts
, tokenized_texts
, labels
could help - but maybe it would make it less convenient to use
As per @yjernite 's proposal, maybe it makes sense to have two categories of metrics: the ones with references and the ones without references (like perplexity and stuff like n-gram diversity). Do you think that makes sense, @lvwerra and @lhoestq ?
Would these be the metrics that you call measurement? Could we set references=None
in these cases so they don't throw an error if a reference is passed? That way they would be easily compatible with the other metrics like BLEU or ROUGE.
Sure, that could work! I think @yjernite preferred the term "referenceless metrics" because they actually measure model performance, but if we implement them with references = None
by default, that would work!
I like referenceless metrics / metrics for unsupervised learning.
In terms of implementation if possible I would remove any mention of "references" to not make things confusing
So you would have a different input structure for the referenceless metrics?
Yes I think it's fine
For perplexity, to fit it into a generic structure and also cover other use cases, it would make sense to have a preprocess
step? Something like:
perplexity = load_metric("perplexity")
model = SomeAutoModel()
def transform(x):
x.update({"logits": model.compute_logits(references)})
return x
perplexity.set_preprocess_transform(transform)
perplexity.compute(some_reference_list)
A little late to the conversation, but for additional context:
"reference-less" metrics are particularly relevant for NLG evaluation, and probably going to become more common in the near future. Perplexity is one case, n-gram diversity another, and approaches like BOLD try to measure model biases without a reference. They are used to measure model performance, so should really live with the other metrics rather than as data measurements.
I'd recommend looking at the GEM-metrics repo if you haven't already as Sebastian has put a lot of thoughts into metrics for NLG there (and I think he'd also be glad if we could make that repo obsolete so he doesn't have to keep maintaining it :stuck_out_tongue_winking_eye: )
In particular: https://github.com/GEM-benchmark/GEM-metrics/blob/main/gem_metrics/metric.py
From a design perspective, I guess one option would be to have the model prediction as required input, and a flexible "extra information" other field that may or may not include references, and could also have the source for text simplification metrics or the supporting paragraph for metrics like QuestEval (measures whether an answer provided by a QA system is supported by the paragraph)
wdyt?
We didn't talk about standardizing outputs much until now -- all metrics currently output dict
s, except cer
and wer
(which output a float
), and mauve
(outputs a types.SimpleNamespace
, whatever that is).
@lhoestq , you mentioned that it's better to output floats instead of dicts, for compability with some kind of framework?
Keras takes a list of "keras metrics" (i.e. callables with input tensors y_true, y_pred
and one float as output) for evaluation. But it seems that using something else that Keras internals to compute the metrics is not recommended: https://github.com/keras-team/keras/issues/9640
Alternatively one can use "keras callbacks". There is a nice example that shows how to use callbacks to use scikit-learn metric in Keras: https://keras.io/examples/keras_recipes/sklearn_metric_callbacks/ - though it's some boiler plate code, maybe we can do something about it
I think we can use metrics that return a dictionary with keras callbacks - this way Keras users can use any metric in our catalog. Keras callbacks can also be used if we return only a float.
Though since our metrics don't subclass the keras callback class, we might need to add a way to convert our metrics to keras callbacks.
Returning a single float is still interesting in some cases. For example it makes things less redundant:
load_metric("accuracy").compute()["accuracy"]
to get the result, while it's obvious they want the "accuracy" field of the output.evaluate
and write a report in YAML, they would have to write things like
results:
Therefore I would actually explore if it's feasible to split the metrics so that they return a single value when possible.
The metrics that don't return a single value can be considered "multi-ouput" metrics and have to be treated differently in reports (maybe using a nested approach as presented in the YAML example above)
Would love your comments and ideas on this subject !
I have been thinking about the scalar vs. dict question. Having a dict across all metrics at least internally is nice as it allows to treat them all the same way and we can also combine metrics by merging dicts. At the same time we could check if the return is just a dict with one value and if that's the case just return its. value.
metric = evaluate.load("accuracy")
>>> 0.6
metric = evaluate.load("accuracy", force_dict=True)
>>> {"accuracy": 0.6}
What do you think @lhoestq?
Regarding Keras, I'll think a bit more about how to do that smoothly.
I think the issue was combining metrics too, no?
I.e. if I run a bunch of metrics and I want to transmit the results somewhere, it's easier to already have them in a dict.
Why not make the force_dict
version the default one and then make a keras
flag (or a scalar
) flag as an alternative?
I think @lhoestq point was that for many metrics the user might just want a single value but always needs to access the dict load_metric("accuracy").compute()["accuracy"]
When we combine metrics we can also check if the return type is a dictionar or scalar and make sure we combine them correctly.
Ok, that works too!
I think I would expect this code to run:
acc = load("accuracy")
f1 = load("f1")
metrics = combine_metrics([acc, f1])
output = metrics.compute(...)
print(output["accuracy"])
print(output["f1"])
without having to specify force_dict
no ?
Maybe combine_metrics
can accept single-output metrics as well as multi-output metrics. For single value metric it would take the metric name as a key, and for multi-output metrics it would take the output keys as is.
Another option would be to always take the metric name as key.
For example for accuracy
and seqeval
, instead of having
output["accuracy"], output["macro avg"]
we could have
output["accuracy"], output["seqeval"]["macro avg"]
Let's discuss this tomorrow together if you want ?
Any updates on this? I'm running into the headache that, indeed, I need to write my own metric-specific edge cases depending on the expected input. If there is a consensus about how to move forward with this, I can work on it relatively elaborately!
Also note that existing metrics may need to checked. Some of them force type transformation themselves, which I think we want to avoid or make part of the library rather than make that a metric requirement (>)
We're faced with this lack of standardisation too. In our package we're working with metrics in a generic fashion, and not having a uniform output format of the metrics is quite frustrating. We can hack our way out of it by checking for the dtype of the outputs, but it would be much preferable if they all returned the same format.
Given that some metrics inherently outputs multiple values (such as squad_v2
), I'd argue that a dict would make the most sense. This would be a simple (but breaking) change, changing cer
, wer
and mauve
to returning dicts.
I can create a quick PR fix if that would be alright, @lvwerra?
We tried to make dicts for most metric, but it's possible we missed those. Yes, feel free to open a PR, but we should also check that some of the popular libraries (e.g. transformers
) using evaluate
will not break if we merge this.
Two things we can do:
evaluate.load("cer"
on GitHub)evaluate.__version__
and only return a dict for the new release existing installations don't suddenly break.Do you want to have a stab at it?
@lvwerra Hello. Can someone help with https://github.com/huggingface/evaluate/issues/516 ?
I am trying to combine two metrics within an ASR task. And it seems that my problem is related to the topic discussed here. I am using a notebook to fine tune the Whisper from @sanchit-gandhi . But if I use the approach of combining WER and CER metrics, everything breaks down. The Trainer
does not get what he needs from the compute_metrics
function. It seems that the problem is that the metrics return not the dictionary but the value....
Currently there are several different inputs/output formats possible in
Metrics
. We should standardize them as much as possible and respecting the following principle:For the output standardization: probably a dictionary structure, even if nested would be ok. Also a dedicated output class could be considered like in
transformer
models but this is probably not necessary here. To make it compatible with e.g.keras
we could add a postprocess function at initialization similar atransform
indatasets
.There are three options we could implement: