Evaluating RAG outputs?

One of the major issues I've been running into is evaluating outputs when you have a document on which retrieval augmented generation is being performed ... especially when it's difficult to obtain baseline human responses.

Is this one of the intended use cases for this plugin?

One (potentially wild) thought:

Have a jury of three diverse LLM's evaluate the output given the context (assuming that Condercet's Jury Theorem holds true even in the case of LLM's being part of the jury panel.)

simonw / llm-evals-plugin

Evaluating RAG outputs? #10