gomate-community / rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.
Apache License 2.0
81 stars 9 forks source link

Define the format of input for all metrics #46

Closed faneshion closed 4 months ago

faneshion commented 4 months ago

Abstract the dataset by defining a new data structure with datapack. Intuitively, a DataPack consists of five parts: question, answer, gt_answer, contexts, and gt_contexts. Currently, we leave search_query and gt_search_query as future work.

Examples:

        >>> question = [
        ...     ['qid1', 'question 1'],
        ...     ['qid2', 'question 2']
        ... ]
        >>> answer = [
        ...     ['aid1', 'answer 1'],
        ...     ['aid2', 'answer 2']
        ... ]
        >>> question = pd.DataFrame(question)
        >>> answer = pd.DataFrame(answer)
        >>> dp = DataPack(
        ...     question=question,
        ...     answer=answer,
        ...     gt_answer=gt_answer,
        ...     contexts=contexts,
        ...     gt_contexts=gt_contexts,
        ... )
        >>> len(dp)
        2

In this way, we can add many inplace functions to process datapack. Some basic usage are as follows:

        >>> import rageval as rl
        >>> data_pack = rl.datasets.toy.load_data()
        >>> data_pack.apply_on_question(preprocess_func)
        >>> data_pack.drop_label(inplace=True)
        >>> data_pack.has_label
        False
faneshion commented 4 months ago

The dataset from huggingface is a good choice to replace the DataPack.

URL: https://github.com/huggingface/datasets

At now, we require each metrics to implement the api:define compute(self, dataset: Dataset) -> (score, Dataset). The column names of the dataset object should be in ["questions", "contexts", "gt_contexts", "answers", "gt_answers"]. An example of dataset are as follows:

>>> from datasets import Dataset
>>> data = {
    "questions": ["what is snoopy", "where is beijing"], 
    "contexts": ["snoopy one", "snoopy two"], 
    "gt_contexts": [{"id": 1, "text": "snoopy 1", "label": 1}, {"id": 2, "text": "snoopy 2", "label": 2}],
    "answers": ["a1", "a2"], 
    "gt_answers": ["a11", "aa2"]
}
>>> dataset = Dataset.from_dict(data)
>>> len(dataset)
2
>>> dataset
Dataset({
    features: ['questions', 'contexts', 'gt_contexts', 'answers', 'gt_answers'],
    num_rows: 2
})

It is worthy to note that each colunm can be extended to more complicated data structures.