lastmile-ai / semantic-retrieval

MIT License
43 stars 2 forks source link

[Evaluation] Consider "variance" over LLM output trials #204

Open jonathanlastmileai opened 11 months ago

jonathanlastmileai commented 11 months ago

@rben01 pointed out something very interesting: due to the stochasticity of LLM outputs, any given metric will in fact be an RV, i.e. have some nonzero variance, which is undesirable because it implies low precision in evaluating your LLM. A lower-variance estimator can be implemented by brute force, analogous to bootstrapping: run the LLM N times and measure the sample's "variance" using some technique simpler than the LLM itself.

This will apply when either the LLM API used does not guarantee reproducible outputs, or stochasticity is explicitly requested via temperature or some related inference parameter.