Open ogencoglu opened 5 months ago
Indeed, that's an insightful observation. To enhance the metric's capability for general reasoning, I have implemented an extension in the evaluation script. You can find the update at this link: https://github.com/sundi133/rag-eval/blob/main/src/evaluation.py#L84. This modification includes the use of a language learning model (LLM) for evaluation purposes. let me know your thoughts on this approach.
rouge
is truly a flawed and limited metric as it simply compares n-grams. It can not capture the actual semantics. Any comments on that?