latitude-dev / latitude-llm

Latitude is the open-source prompt engineering platform to build, evaluate, and refine your prompts with AI
https://latitude.so
GNU Lesser General Public License v3.0
833 stars 52 forks source link

Evaluations – Remove conflicts in evaluation objective between configuration and prompt #420

Open samulatitude opened 3 weeks ago

samulatitude commented 3 weeks ago

What?

Right now, there is a config when creating an evaluation that sets the result (numeric between 1 and 5), but we don't pass this to the prompt, so the user can set a range between 9 and 20, and it would be the one that takes into account.

In summary, there are 2 sources of truth.

https://www.figma.com/design/ODioXiqX8aeDMonsh0HBui/Latitude-Cloud?node-id=2738-34189&t=C31y3Hbykh3pzF2x-4

csansoon commented 1 week ago

The plan

To do this, the schema must change.

Now, each evaluation will have 2 polymorphic relations: metadataType and resultType

There will currently be 2 EvaluationMetadataTypes:

And 3 EvaluationResultConfigurations, which will depend on a ResultableType:

The evaluation will expect results depending on the resultType, and will have different behaviour depending on its type.

This allows for many more types of evaluations in the future, both llmAsJudge or any other type (like Human in the Loop), while maintaining the resultable types that we have now.

EvaluationResults will still be the same, as it still fits the use case

Development breakdown

Part 1 — EvaluationMetadataLlmAsJudgeAdvanced

In this first part, I'll focus on modifying and migrating to the new EvaluationMetadataLlmAsJudgeAdvanced schema.

This type does not require a resultConfiguration yet, since it is defined in the configuration json. I'll just move this json to the EvaluationMetadataLlmAsJudgeAdvanced table for advanced usage.

Migration is deployed at a separate time from the code. As a result, we cannot expect the code to work before the migration or after it. To address this, this part is divided in 4 PRs:

Part 2 — EvaluationMetadataLlmAsJudge and EvaluationConfiguration tables.

Here, I'll create the the EvaluationMetadataLlmAsJudge table, and one table for each EvaluationConfiguration result type. Also, modify the EvaluationDto type and EvaluationRepository to return the new type.

Deployment is split in five steps:

Part 3 — New UI

Here I'll create the services and UI to create the new types of evaluations, although they won't be used in production yet.

Part 4 — Migration

Finally, swap the options to create evaluations to the new simple types.