Closed lvwerra closed 2 years ago
Cool ! I think of this class as a nice bonus, similarly to transformers
: everything can be used separately in your own for
loop for training.
Maybe let's see how the evaluation as a service project deal with the way to connect model/dataset/metric and derive the internals of the Evaluator
from there when it becomes more mature ? cc @lewtun
I think one way to validate whether we like this idea is to implement some of the intended use-cases:
My question would be, how do we add these in a way that's (mostly) model/dataset/metric-agnostic?
If I understand correctly 1 & 3 could be implemented essentially the same way: run the same evaluation several times with different seeds and do some aggregation at the end. For 3) one would just need to make sure dropout is activated, right?
Maybe we can talk to the optimization team that did some experiments on 2 on different hardware/setups. Maybe @lewtun knows something about it?
Maybe we can talk to the optimization team that did some experiments on 2 on different hardware/setups. Maybe @lewtun knows something about it?
This is WIP as part of https://github.com/huggingface/optimum/issues/128 but we don't have anything concrete available yet. Usually what we do is fix the hardware to a specific instance (e.g. a specific Intel CPU VM on AWS) and then compute latencies at various percentiles like p50, p90 and p95
Here is a draft of what an Evaluator
could look like:
class Evaluator:
def __init__(self, pipe, data, metric, label_map=None):
self.pipe = pipe
self.data = data
self.metric = metric
self.label_map = label_map
def compute(self):
predictions = self.pipe(self.data["inputs"], truncation=True)
predictions = [self.label_map[element["label"]] for element in predictions]
result = self.metric.compute(predictions=predictions, references=self.data["references"])
return result
Then you could evaluate a (model, dataset, metric) combo with:
from transformers import pipeline
from datasets import load_dataset
from evaluate import load_metric
pipe = pipeline("text-classification")
metric = load_metric("accuracy")
ds = load_dataset("imdb")
ds = ds["test"].shuffle().select(range(32)) # just for speed
ds = ds.rename_columns({"text": "inputs", "label": "references"})
evaluator = Evaluator(pipe, ds, metric, label_map={"NEGATIVE": 0, "POSITIVE": 1})
evaluator.compute()
>>> {'accuracy': 0.90625}
There is maybe a better way to deal with the input/output between pipeline, dataset, and metric. What do you think @douwekiela?
Yeah this makes a ton of sense to me! My only qualm would be that it'd be even more elegant if I wouldn't have to rename the columns and pass the label_map (can these not be set in dataset metadata?).
Yeah this makes a ton of sense to me! My only qualm would be that it'd be even more elegant if I wouldn't have to rename the columns and pass the label_map (can these not be set in dataset metadata?).
We do have a column mapping field called col_mapping
in the dataset metadata that is currently used for AutoTrain - I think this could be reused here and would only require the Evaluator
to follow that convention (basically inputs are mapped to text
and labels are mapped to target
).
Regarding label_map
, this is tricky because the model owner can choose arbitrary names for the model's classes (e.g. NEGATIVE
, neg
, LABEL_0
, etc). If you have ideas, I'd love to hear them because we're also facing this issue on the evaluation side: some evaluations crash because of the label mismatch 🙀
We do have a column mapping field called col_mapping in the dataset metadata that is currently used for AutoTrain - I think this could be reused here and would only require the Evaluator to follow that convention (basically inputs are mapped to text and labels are mapped to target).
That would mean that we need to load the datasets metadata from the hub as that information is not in the Dataset
object itself, right? That means downloading the README.md
from the Hub and then use huggingface_hub.load_metadata
. We could pass just the dataset name to Evaluator
and do the loading and parsing under the hood but that would not work so well for custom datasets.
Alternatively we are very transparent about what the inputs should be. Maybe we can do something as follows:
evaluator = Evaluator('text-classification')
print(evaluator)
>>> Expected data format:
>>> name: "inputs", type: str, description: "input texts"
>>> name: "labels", type: Union[string, int], description: labels
evaluator.compute(pipe, ds, metric, label_map={"NEGATIVE": 0, "POSITIVE": 1})
My feeling is that this feature will be frequently used on custom/local/in-house datasets anyway where we don't have access to meta information. For datasets on the hub we could also provide a helper function (e.g. infer_input_map(dataset_name)
that does the steps mentioned above such that:
ds = ds.rename_columns(evaluate.infer_input_map("imdb"))
Good point about custom datasets and I like your idea about explicitly showing the expected inputs / outputs in the repr
!
For Hub datasets, you don't have to download any files as you can ping the datasets
API directly, e.g.
import requests
from typing import Dict, Union
def get_metadata(dataset_name: str) -> Union[Dict, None]:
data = requests.get(f"https://huggingface.co/api/datasets/{dataset_name}").json()
if data["cardData"] is not None and "train-eval-index" in data["cardData"].keys():
return data["cardData"]["train-eval-index"]
else:
return None
metadata = get_metadata("imdb")
The only "problem" is that we've defined our column mappings to align with AutoTrain, and that taxonomy might not be as convenient / flexible for what you're trying to do.
data = requests.get(f"https://huggingface.co/api/datasets/{dataset_name}").json()
or just huggingface_hub.dataset_info(dataset_name)
or just
huggingface_hub.dataset_info(dataset_name)
I did not know this trick either 🤯 !
Hello, I would be a user of a similar feature in https://github.com/huggingface/autoquantize (API to launch evaluation of Optimum quantized models vs. transformers baseline). For the moment I wrote my own evaluation scripts using pipelines.
Just wanted to point out that using pipelines for evaluation is not out-of-the-box, see for example https://github.com/huggingface/transformers/issues/17305 and https://github.com/huggingface/transformers/issues/17139 . At least I haven't found a way to make it work task-independent. I would be glad if somebody is working on this to discuss.
My approach is the following: https://github.com/fxmarty/optimum/tree/runs-only/optimum/utils/preprocessing
See as well https://github.com/huggingface/optimum/pull/194 .
ping @mfuntowicz as well
Similar to the
Trainer
class intransformers
it would be nice to easily evaluate a model on a dataset given a metric. We could use theTrainer
but it comes with a lot of unused extra stuff and istransformers
centric. Alternatively we could build anEvaluator
as follows:The
dataset_mapping
maps thedataset
columns to inputs for the model and metric. Using thepipeline
API as the standard for theEvaluator
this could easily be extended to any other framework. The user would just need to setup apipeline
class with the main requirement being that inputs and outputs follow the same format and that the class has implemented a__call__
method.The advantage of starting with the
pipeline
API is that intransformers
it already implements a lot of quality of life functionality such as batching and GPU. Also it abstracts away the pre/post-processing.In #16 it is mentioned that statistical significance testing would be a desired feature. The above example could be extended to enable this:
Where under the hood the random seed is changed between the runs.
cc @douwekiela @osanseviero @nrajani @lhoestq