Open NivekT opened 11 months ago
@NivekT i think if we add this with #31 , it might be faster and we can build on top of Llama-index's evals.
What do you think?
So we want to accept an array of models and evaluate against all of them, right?
@rachittshah We can consider add LlamaIndex's eval if it integrates well with the pattern we have here. Feel free to propose something and we can have a look.
@divij9 That can be part of it, but for each of the eval function linked above, they currently only support OpenAI or Anthropic.
I will update the main issue to break the request into pieces that are easier for first-time contributors to work on.
I have updated the ask to be bite-size. Feel free to comment if anything is unclear!
I think I understand what our goal is. Can you please assign this to me?
@Divij97 Sure! Let us know if you plan to work on all 4 subtasks or a specific one. Feel free pick whichever you think you can contribute to. Thanks!
I'd love to help adding support for new models using https://github.com/BerriAI/litellm. Let me know if I can help out on this too
@ishaan-jaff Awesome! Could you create an issue for it and I can assign that to you? I think the best approach would be to create a LitellmExperiment
, you can follow some examples we've done for other APIs:
Hi @steventkrawczyk I have made the required changes but I am not able to push. I signed the cla but the push still returns a 403 for me
@Divij97 you will need to push to a fork and raise a PR from the fork to the original repo
Nevermind it was a key chain issue with my stupid mac. Can you please take a look at this PR: https://github.com/hegelai/prompttools/pull/59. It's not complete but I wanted to understand if I am headed in the right direction
Wanted to update about my progress. I am done with all the changes but wanted some help testing them out for Anthropic. How do I generate an Anthropic key to test?
🚀 The feature
This is a good task for a new contributor
We have a few utility functions to perform AutoEval:
https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_scoring.py https://github.com/hegelai/prompttools/blob/main/prompttools/utils/expected.py
Currently, they tend to only support one model each. Someone can re-factor the code for each of them to support multiple models. I would recommend making sure they all support for the best known models such as GPT-4 and Claude 2.
We can even consider LlaMA but that is less urgent.
Tasks
autoeval_binary_scoring
can take inmodel
as an argument. Let's make suregpt-4
andclaude-2
are both accepted and invoke the right completion function.autoeval_scoring
. OpenAI needs to added here.compute_similarity_against_model
gpt-4
andclaude-2
) at the same timeMotivation, pitch
Allow people to auto-evaluate with different best models would be ideal
Alternatives
No response
Additional context
No response