potsawee / selfcheckgpt

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
MIT License
467 stars 54 forks source link

Is this method actually useful in the real world? #13

Closed YooSungHyun closed 1 year ago

YooSungHyun commented 1 year ago
  1. Must sampled LLM: It takes so long
  2. need to additional model: It is takes more GPU MEM
  3. No scoring criteria: So, how much of the score is Hallucination?

What do you think? i think that is maybe useful test metric or something. but, can not useful for real-time api service. (like, if i want to check hallucination, i don't response to client), i think that is needs to take additional hardware resource and response time....

YooSungHyun commented 1 year ago

and, more.

  1. when if, LLM predict result is too much hallucination, i think QAG, BERTScore will be approximate about 0 (not hallucinate)
  2. LLM predict result is so short.(just 1 sentence's passage). it is worked good?

what do you think?

potsawee commented 1 year ago

@YooSungHyun

  1. The method is a sampling-based method, so it will require LLM to generate samples. Depends on what you do, I assume for applications that require real-time outputs, it could be difficult to make this process fast.
  2. One can optimise additional module for performance, e.g., distillation. In this work, we worked on a research problem to understand to what extent this method performs. For production, one can optimise the additional modules further.
  3. To use the score, you can think of the threshold as your hyperparameter (i.e. operating point) and you can set it on your development data or use some heuristic (e.g., low threshold, higher recall)
YooSungHyun commented 1 year ago

I think that answers the question, thank you.