lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.38k stars 4.48k forks source link

Inquiry: Utilizing LLM-as-a-Judge for Dynamic Evaluation of Simple Dialogue Systems #2775

Open Mikeygoldman1 opened 9 months ago

Mikeygoldman1 commented 9 months ago

Hello,

I recently came across the "llm_judge" implementation and was intrigued by the concept of using large language models (LLMs) as judges to evaluate chat assistants. My interest lies in understanding whether this implementation can be effectively applied to evaluate a simple dialogue system in a dynamic manner. Specifically, I'm curious about the feasibility of assessing an assistant's response based on the user's preceding input within a multi-turn conversation.

The ability to dynamically evaluate responses in a dialogue system seems crucial for developing more intuitive and user-aligned chatbots. If this repository and the methodologies described within it (such as using MT-bench and Chatbot Arena) support such dynamic evaluation, I would appreciate more insights into its implementation and potential limitations.

Alternatively, if this repository isn't suited for the specific case of evaluating simple dialogue systems in a dynamic context, any guidance or recommendations for resources or directions that could assist in this endeavor would be greatly appreciated.

Thank you for your time and assistance.

OmriBenShoham commented 9 months ago

Following 👍