Hi.
First of all, thanks for this amazing contribution for RAG Hallucination detection.
I tested out the 8B version and found that PASS and FAIL scores were existed in the model response based on the same question and provided CONTEXT. Here is an example of model response.
{"REASONING": ['The CONTEXT outlines the steps to resolve disputes within ABC Company.', 'It mentions following the grievance procedure and then pursuing legal action if the dispute cannot be resolved through the procedure.', 'The CONTEXT also emphasizes the importance of open and respectful communication and the role of HR in assisting with conflict resolution.', 'The ANSWER accurately reflects these points by stating the resolution involves following the grievance procedure, the option to pursue legal action, and the encouragement of open communication with HR support.', 'Therefore, the ANSWER is faithful to the CONTEXT given the QUESTION.'], "SCORE": PASS}{"REASONING": ['The CONTEXT does not explicitly state that HR will assist with conflict resolution when necessary.', 'The ANSWER includes a statement about HR assistance, which is not directly supported by the CONTEXT.', 'This additional information about HR assistance introduces a detail that is not present in the original CONTEXT.'], "SCORE": FAIL}{"REASONING": ['The CONTEXT mentions that employees should address conflicts directly with their colleagues and manager.', 'The ANSWER does not include any information about addressing conflicts directly with colleagues and manager.', 'This omission means the ANSWER does not fully capture all the relevant details provided in the CONTEXT.'], "SCORE": FAIL}
I'm curious about the meaning of the sequence of these [REASONING/SCORE] pairs. Does the first REASONING/SCORE pair best represent the overall judgement from the model?
What's your recommendation to deal with this situation?
Hi. First of all, thanks for this amazing contribution for RAG Hallucination detection.
I tested out the 8B version and found that PASS and FAIL scores were existed in the model response based on the same question and provided CONTEXT. Here is an example of model response.
I'm curious about the meaning of the sequence of these [REASONING/SCORE] pairs. Does the first REASONING/SCORE pair best represent the overall judgement from the model?
What's your recommendation to deal with this situation?