@FBzzh @yuanpcr you can list all potential metrics for the validate task in this issue. For more details about the validate task, you can refer to issue #13 .
Here are some metrics related to the answer groundness.
[ ] Knowledge F1. A lexical overlap metric used for knowledge-grounded dialogue, which checks the F1 score between the tokens of gold passages and model responses.
[ ] Knowledge F1 ++. A variant of K-F1 that discounts tokens from user question or the conversation history in the model response.
[ ] Faithfulness (RAGAS). Use LLM to extract the statements in the model response, and then determines whether these statements can be inffered from the given contexts.
[ ] FActScore. A LLM-based method that breaks down the generated text into a series of atom facts, and then evaluates whether these facts are supported by the knowledge source.
[ ] QUIP-Score. An n-gram overlap measure that quantifies the degree to which a generated passage consists of exact spans found in a text corpus.
@FBzzh @yuanpcr you can list all potential metrics for the
validate
task in this issue. For more details about thevalidate
task, you can refer to issue #13 .