For LLM evaluation user specifies LLMMetric config in the yaml config and possibly implements/extends existing LLM metric and changes the LLMMetric factory in factgenie/evaluate.py. Note all the metrics according to llm-eval/*yaml configs are loaded and offered.
So most of the LLM annotation campaign is defined in code.
With the single exception of the error_categories which one must insert via web browser dialog and they must match the categories specified in the yaml config for the LLM prompt.
Proposal
[ ] Rename error_categories to annotation_span_categories
[ ] Allow specifying the annotation_span_categories in the yaml metric configs in llm-eval/your_metric.yaml
[ ] Allow loading creating a human evaluation based on existing llm-eval campaign. Load the same annotation types from there.
Current state
For LLM evaluation user specifies LLMMetric config in the yaml config and possibly implements/extends existing LLM metric and changes the LLMMetric factory in
factgenie/evaluate.py
. Note all the metrics according tollm-eval/*yaml
configs are loaded and offered. So most of the LLM annotation campaign is defined in code. With the single exception of theerror_categories
which one must insert via web browser dialog and they must match the categories specified in the yaml config for the LLM prompt.Proposal
error_categories
toannotation_span_categories
annotation_span_categories
in the yaml metric configs inllm-eval/your_metric.yaml