Researching evaluations

Hi Simon,

In case its useful, I am collecting research notes on LLM evaluation frameworks for my own needs. I have collected over 11k prompts in my llm cli db, and I would love to mine those for evaluations data. My own evaluation needs are complex. I am trying to evaluate some multi-agent interactions. I need to evaluate each response, as well as evaluate the overall conversation, and progress towards a goal. My first attempt at automated evals using gemini-1.5-pro-latest just resulted in random scores being produced. Running the same test multiple times generates different scores, even with temperature 0. Opus is a lot better. So I created a grading rubric and had opus grade 10 examples and justify each one. My hope is that I can take advantage of the gemini context length and feed it the samples from opus to improve its grading abilities. Opus is too expensive to use for the whole project.

Before I go too far with that, I wanted to collect some research notes. I keep my bookmarks and notes in gh issues, and I embed them using your llm cli and jina embeddings model.

To help me reading papers, I just generated a quick app: https://github.com/irthomasthomas/clipnotes, with this monitoring the clipboard while I read and collecting the copied items to a file for further processing.

The first paper is GPTScore: Evaluate as You Desire. It includes tests with some older architecture models like encoder-decoder models, which aren't used today, so I ignored those and focused on the results for the GPT3 models.

https://github.com/irthomasthomas/undecidability/issues/823

You might find some other useful links under the llm-evalution label.

Ta, Thomas

simonw / llm-evals-plugin

Researching evaluations #9