huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
2.04k stars 258 forks source link

Add `Evaluator` class to easily evaluate a combination of (model, dataset, metric) #23

Closed lvwerra closed 2 years ago

lvwerra commented 2 years ago

Similar to the Trainer class in transformers it would be nice to easily evaluate a model on a dataset given a metric. We could use the Trainer but it comes with a lot of unused extra stuff and is transformers centric. Alternatively we could build an Evaluator as follows:

from evaluate import Evaluator
from evaluate import load_metric
from dataset import load_dataset
from transformers import pipeline

metric = load_metric("bleu")
dataset = load_dataset("wmt19", language_pair=("de", "en"))
pipe = pipeline("translation", model="opus-mt-de-en"))

# WMT specific transform
dataset = dataset.map(lambda x: {"source": x["translation"]["de"], "target": x["translation"]["en"]}) 

evaluator = Evaluator(
    model=pipe,
    dataset=dataset,
    metric=metric,
    dataset_mapping={"model_input": "source", "references": "target"}
)

evaluator.evaluate()
>>> {"bleu": 12.4}

The dataset_mapping maps the dataset columns to inputs for the model and metric. Using the pipeline API as the standard for the Evaluator this could easily be extended to any other framework. The user would just need to setup a pipeline class with the main requirement being that inputs and outputs follow the same format and that the class has implemented a __call__ method.

The advantage of starting with the pipeline API is that in transformers it already implements a lot of quality of life functionality such as batching and GPU. Also it abstracts away the pre/post-processing.

In #16 it is mentioned that statistical significance testing would be a desired feature. The above example could be extended to enable this:

evaluator.evalute(n_runs=42)
>>> [{"bleu": 12.4}, {"bleu": 8.3}, ...]

Where under the hood the random seed is changed between the runs.

cc @douwekiela @osanseviero @nrajani @lhoestq

lhoestq commented 2 years ago

Cool ! I think of this class as a nice bonus, similarly to transformers: everything can be used separately in your own for loop for training.

Maybe let's see how the evaluation as a service project deal with the way to connect model/dataset/metric and derive the internals of the Evaluator from there when it becomes more mature ? cc @lewtun

douwekiela commented 2 years ago

I think one way to validate whether we like this idea is to implement some of the intended use-cases:

  1. Significance testing via bootstrap sampling as in @lvwerra 's example
  2. Latency/efficiency metrics (e.g., examples/s? hardware matters here though, how do we deal with that?)
  3. Uncertainty estimates (e.g. variational dropout?)

My question would be, how do we add these in a way that's (mostly) model/dataset/metric-agnostic?

lvwerra commented 2 years ago

If I understand correctly 1 & 3 could be implemented essentially the same way: run the same evaluation several times with different seeds and do some aggregation at the end. For 3) one would just need to make sure dropout is activated, right?

Maybe we can talk to the optimization team that did some experiments on 2 on different hardware/setups. Maybe @lewtun knows something about it?

lewtun commented 2 years ago

Maybe we can talk to the optimization team that did some experiments on 2 on different hardware/setups. Maybe @lewtun knows something about it?

This is WIP as part of https://github.com/huggingface/optimum/issues/128 but we don't have anything concrete available yet. Usually what we do is fix the hardware to a specific instance (e.g. a specific Intel CPU VM on AWS) and then compute latencies at various percentiles like p50, p90 and p95

lvwerra commented 2 years ago

Here is a draft of what an Evaluator could look like:

class Evaluator:
    def __init__(self, pipe, data, metric, label_map=None):
        self.pipe = pipe
        self.data = data
        self.metric = metric
        self.label_map = label_map

    def compute(self):
        predictions = self.pipe(self.data["inputs"], truncation=True)
        predictions = [self.label_map[element["label"]] for element in predictions]
        result = self.metric.compute(predictions=predictions, references=self.data["references"])
        return result

Then you could evaluate a (model, dataset, metric) combo with:

from transformers import pipeline
from datasets import load_dataset
from evaluate import load_metric

pipe = pipeline("text-classification")
metric = load_metric("accuracy")

ds = load_dataset("imdb")
ds = ds["test"].shuffle().select(range(32)) # just for speed
ds = ds.rename_columns({"text": "inputs", "label": "references"})

evaluator = Evaluator(pipe, ds, metric, label_map={"NEGATIVE": 0, "POSITIVE": 1})

evaluator.compute()
>>> {'accuracy': 0.90625}

There is maybe a better way to deal with the input/output between pipeline, dataset, and metric. What do you think @douwekiela?

douwekiela commented 2 years ago

Yeah this makes a ton of sense to me! My only qualm would be that it'd be even more elegant if I wouldn't have to rename the columns and pass the label_map (can these not be set in dataset metadata?).

lewtun commented 2 years ago

Yeah this makes a ton of sense to me! My only qualm would be that it'd be even more elegant if I wouldn't have to rename the columns and pass the label_map (can these not be set in dataset metadata?).

We do have a column mapping field called col_mapping in the dataset metadata that is currently used for AutoTrain - I think this could be reused here and would only require the Evaluator to follow that convention (basically inputs are mapped to text and labels are mapped to target).

Regarding label_map, this is tricky because the model owner can choose arbitrary names for the model's classes (e.g. NEGATIVE, neg, LABEL_0, etc). If you have ideas, I'd love to hear them because we're also facing this issue on the evaluation side: some evaluations crash because of the label mismatch 🙀

lvwerra commented 2 years ago

We do have a column mapping field called col_mapping in the dataset metadata that is currently used for AutoTrain - I think this could be reused here and would only require the Evaluator to follow that convention (basically inputs are mapped to text and labels are mapped to target).

That would mean that we need to load the datasets metadata from the hub as that information is not in the Dataset object itself, right? That means downloading the README.md from the Hub and then use huggingface_hub.load_metadata. We could pass just the dataset name to Evaluator and do the loading and parsing under the hood but that would not work so well for custom datasets.

Alternatively we are very transparent about what the inputs should be. Maybe we can do something as follows:

evaluator = Evaluator('text-classification')

print(evaluator)
>>> Expected data format: 
>>> name: "inputs", type: str, description: "input texts"
>>> name: "labels", type: Union[string, int], description: labels

evaluator.compute(pipe, ds, metric, label_map={"NEGATIVE": 0, "POSITIVE": 1})

My feeling is that this feature will be frequently used on custom/local/in-house datasets anyway where we don't have access to meta information. For datasets on the hub we could also provide a helper function (e.g. infer_input_map(dataset_name) that does the steps mentioned above such that:

ds = ds.rename_columns(evaluate.infer_input_map("imdb"))
lewtun commented 2 years ago

Good point about custom datasets and I like your idea about explicitly showing the expected inputs / outputs in the repr!

For Hub datasets, you don't have to download any files as you can ping the datasets API directly, e.g.

import requests
from typing import Dict, Union

def get_metadata(dataset_name: str) -> Union[Dict, None]:
    data = requests.get(f"https://huggingface.co/api/datasets/{dataset_name}").json()
    if data["cardData"] is not None and "train-eval-index" in data["cardData"].keys():
        return data["cardData"]["train-eval-index"]
    else:
        return None

metadata = get_metadata("imdb")

The only "problem" is that we've defined our column mappings to align with AutoTrain, and that taxonomy might not be as convenient / flexible for what you're trying to do.

julien-c commented 2 years ago

data = requests.get(f"https://huggingface.co/api/datasets/{dataset_name}").json()

or just huggingface_hub.dataset_info(dataset_name)

lewtun commented 2 years ago

or just huggingface_hub.dataset_info(dataset_name)

I did not know this trick either 🤯 !

fxmarty commented 2 years ago

Hello, I would be a user of a similar feature in https://github.com/huggingface/autoquantize (API to launch evaluation of Optimum quantized models vs. transformers baseline). For the moment I wrote my own evaluation scripts using pipelines.

Just wanted to point out that using pipelines for evaluation is not out-of-the-box, see for example https://github.com/huggingface/transformers/issues/17305 and https://github.com/huggingface/transformers/issues/17139 . At least I haven't found a way to make it work task-independent. I would be glad if somebody is working on this to discuss.

My approach is the following: https://github.com/fxmarty/optimum/tree/runs-only/optimum/utils/preprocessing

See as well https://github.com/huggingface/optimum/pull/194 .

ping @mfuntowicz as well