Technical Design of the Evaluation Module

Data Format

Questions

Each question record should have the following fields:

id (int, compulsory): The ID of the instruction.
instruction (str, compulsory): The instruction for the LLM.
category (str, compulsory): The category of the instruction.
input (str, optional): The additional context of the instruction.
output (str, optional): The sample output of the instruction (default: GPT-3.5).
target (str, optional): The target answer for the instruction.

Note: if the input has a gold standard, the output can be empty. Otherwise, we generate answers from GPT-3.5 as the output, and the target field is empty.

While evaluating the performance, if the target is empty, use the value from output.

Answers

The JSON file contains one list. Each element in the list is an answer record to one question. An answer record has the following field:

id (int, compulsory): The ID of the instruction.
instruction (str, compulsory): The instruction for the LLM.
category (str, compulsory): The category of the instruction.
input (str, optional): The additional context of the instruction.
output (str, compulsory): The output from the LLM.

Evaluation

Configuration

We assume that all answers are generated and saved following the internal data structure.

Configuration file for the evaluator module: config_eval.json.

This file controls how we evaluate the performance of the model.

{
    "language": "eng",
    "category": {
        "role play": {
            "GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
            "GPT-4": ["fluency", "coherence", "consistency", "relevance"],
            "Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]
        },
        "Multi-turn conversation": {
            "GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
            "GPT-4": ["fluency", "coherence", "consistency", "relevance"],
            "Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]  
        },
        "Open QA": {
            "GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
            "GPT-4": ["fluency", "coherence", "consistency", "relevance"],
            "Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]
        }
    }
}

The value for GPT-3.5 and GPT-4 can be a empty list, the value for Metrics can also be empty. For example, for classification tasks, you only need to put Presicion, Recall and F1 score.

We support eng and ch now.

Code Architecture

evaluator.py: Main class for the evaluator.

class Evaluator(object):
    def __init__(self, params: Dict) -> None:
        self.params = params
        self.stats = dict()

    def battle(self, answers1: Dict, answers2: Dict) -> None:
        """
        Comparison between two models using GPT-4 as the reviewer.
        """
        pass

    def evaluate(self, answers: Dict) -> None:
        """
        A comprehensive evaluation of the answers from the model.
        The function evaluates the model's performance from different perspectives 
        using GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.

       The metrics will be decided by the config file.

        """
        pass

    def save(self, path: str) -> None:
        pass

Results will be saved as a JSON file. Please save all files in a separate folder.

metrics.py: the function file that contains all metrics. One function defines one metric.

def rouge_score(preds: List, target: List) -> Dict:
    rouge_scores = {"rouge1": 0, "rouge2": 0, "rougeL": 0}

    # calculate scores

    return rouge_scores

eval.py: driver function that initialises the evaluator.

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # load config
    # initialize evaluator

If two files are provided, we should use battle, otherwise, we will call evaluate.

Existing functions for generating answers can be moved to a separate folder. Please see below for the folder structure:

eval
   - eval.py
   - metrics.py
   - gpt_evaluate.py
   - evaluator.py
   - utlis.py
   - results
   - generate_answers
      - generate_gpt35_answers.py
      - ...

TongLi3701 commented 1 year ago

@Camille7777 @chengeharrison

Please have a look at the technical design of the new evaluation module. Any questions or suggestions, please let me know.

FYI @ver217

TongLi3701 commented 1 year ago

@Camille7777 @chengeharrison

Please merge your code to this development first: https://github.com/hpcaitech/ColossalAI/tree/dev/evaluation

hpcaitech / ColossalAI

[FEATURE]: Technical Design of the Evaluation Module #3714

Describe the feature