Define the format of input for all metrics

gomate-community / rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.

Apache License 2.0

81 stars 9 forks source link

Abstract the dataset by defining a new data structure with datapack. Intuitively, a DataPack consists of five parts: question, answer, gt_answer, contexts, and gt_contexts. Currently, we leave search_query and gt_search_query as future work.

Examples:

        >>> question = [
        ...     ['qid1', 'question 1'],
        ...     ['qid2', 'question 2']
        ... ]
        >>> answer = [
        ...     ['aid1', 'answer 1'],
        ...     ['aid2', 'answer 2']
        ... ]
        >>> question = pd.DataFrame(question)
        >>> answer = pd.DataFrame(answer)
        >>> dp = DataPack(
        ...     question=question,
        ...     answer=answer,
        ...     gt_answer=gt_answer,
        ...     contexts=contexts,
        ...     gt_contexts=gt_contexts,
        ... )
        >>> len(dp)
        2

In this way, we can add many inplace functions to process datapack. Some basic usage are as follows:

        >>> import rageval as rl
        >>> data_pack = rl.datasets.toy.load_data()
        >>> data_pack.apply_on_question(preprocess_func)
        >>> data_pack.drop_label(inplace=True)
        >>> data_pack.has_label
        False

>>> from datasets import Dataset >>> data = { "questions": ["what is snoopy", "where is beijing"], "contexts": ["snoopy one", "snoopy two"], "gt_contexts": [{"id": 1, "text": "snoopy 1", "label": 1}, {"id": 2, "text": "snoopy 2", "label": 2}], "answers": ["a1", "a2"], "gt_answers": ["a11", "aa2"] } >>> dataset = Dataset.from_dict(data) >>> len(dataset) 2 >>> dataset Dataset({ features: ['questions', 'contexts', 'gt_contexts', 'answers', 'gt_answers'], num_rows: 2 })

gomate-community / rageval

Define the format of input for all metrics #46