Closed TongLi3701 closed 9 months ago
@Camille7777 @chengeharrison
Please have a look at the technical design of the new evaluation module. Any questions or suggestions, please let me know.
FYI @ver217
@Camille7777 @chengeharrison
Please merge your code to this development first: https://github.com/hpcaitech/ColossalAI/tree/dev/evaluation
Describe the feature
Technical Design of the Evaluation Module
Data Format
Questions
Each question record should have the following fields:
id
(int, compulsory): The ID of the instruction.instruction
(str, compulsory): The instruction for the LLM.category
(str, compulsory): The category of the instruction.input
(str, optional): The additional context of the instruction.output
(str, optional): The sample output of the instruction (default: GPT-3.5).target
(str, optional): The target answer for the instruction.Note: if the
input
has a gold standard, theoutput
can be empty. Otherwise, we generate answers from GPT-3.5 as theoutput
, and thetarget
field is empty.While evaluating the performance, if the
target
is empty, use the value fromoutput
.Answers
The JSON file contains one list. Each element in the list is an answer record to one question. An answer record has the following field:
id
(int, compulsory): The ID of the instruction.instruction
(str, compulsory): The instruction for the LLM.category
(str, compulsory): The category of the instruction.input
(str, optional): The additional context of the instruction.output
(str, compulsory): The output from the LLM.Evaluation
Configuration
We assume that all answers are generated and saved following the internal data structure.
Configuration file for the evaluator module:
config_eval.json
.This file controls how we evaluate the performance of the model.
The value for
GPT-3.5
andGPT-4
can be a empty list, the value forMetrics
can also be empty. For example, for classification tasks, you only need to putPresicion
,Recall
andF1 score
.We support
eng
andch
now.Code Architecture
evaluator.py
: Main class for the evaluator.Results will be saved as a JSON file. Please save all files in a separate folder.
metrics.py
: the function file that contains all metrics. One function defines one metric.eval.py
: driver function that initialises the evaluator.If two files are provided, we should use
battle
, otherwise, we will callevaluate
.Existing functions for generating answers can be moved to a separate folder. Please see below for the folder structure: