need to additional model: It is takes more GPU MEM
No scoring criteria: So, how much of the score is Hallucination?
What do you think? i think that is maybe useful test metric or something.
but, can not useful for real-time api service. (like, if i want to check hallucination, i don't response to client), i think that is needs to take additional hardware resource and response time....
The method is a sampling-based method, so it will require LLM to generate samples. Depends on what you do, I assume for applications that require real-time outputs, it could be difficult to make this process fast.
One can optimise additional module for performance, e.g., distillation. In this work, we worked on a research problem to understand to what extent this method performs. For production, one can optimise the additional modules further.
To use the score, you can think of the threshold as your hyperparameter (i.e. operating point) and you can set it on your development data or use some heuristic (e.g., low threshold, higher recall)
What do you think? i think that is maybe useful test metric or something. but, can not useful for real-time api service. (like, if i want to check hallucination, i don't response to client), i think that is needs to take additional hardware resource and response time....