Closed faneshion closed 4 months ago
The dataset from huggingface is a good choice to replace the DataPack.
URL: https://github.com/huggingface/datasets
At now, we require each metrics to implement the api:define compute(self, dataset: Dataset) -> (score, Dataset)
. The column names of the dataset object should be in ["questions", "contexts", "gt_contexts", "answers", "gt_answers"]
. An example of dataset are as follows:
>>> from datasets import Dataset
>>> data = {
"questions": ["what is snoopy", "where is beijing"],
"contexts": ["snoopy one", "snoopy two"],
"gt_contexts": [{"id": 1, "text": "snoopy 1", "label": 1}, {"id": 2, "text": "snoopy 2", "label": 2}],
"answers": ["a1", "a2"],
"gt_answers": ["a11", "aa2"]
}
>>> dataset = Dataset.from_dict(data)
>>> len(dataset)
2
>>> dataset
Dataset({
features: ['questions', 'contexts', 'gt_contexts', 'answers', 'gt_answers'],
num_rows: 2
})
It is worthy to note that each colunm can be extended to more complicated data structures.
Abstract the dataset by defining a new data structure with datapack. Intuitively, a DataPack consists of five parts:
question
,answer
,gt_answer
,contexts
, andgt_contexts
. Currently, we leavesearch_query
andgt_search_query
as future work.Examples:
In this way, we can add many inplace functions to process datapack. Some basic usage are as follows: