[Evaluation] Consider "variance" over LLM output trials

@rben01 pointed out something very interesting: due to the stochasticity of LLM outputs, any given metric will in fact be an RV, i.e. have some nonzero variance, which is undesirable because it implies low precision in evaluating your LLM. A lower-variance estimator can be implemented by brute force, analogous to bootstrapping: run the LLM N times and measure the sample's "variance" using some technique simpler than the LLM itself.

This will apply when either the LLM API used does not guarantee reproducible outputs, or stochasticity is explicitly requested via temperature or some related inference parameter.

lastmile-ai / semantic-retrieval

[Evaluation] Consider "variance" over LLM output trials #204