Open steventkrawczyk opened 1 year ago
Can I work on this?
@LuvvAggarwal Sure thing. The scope of this one is a bit large because we currently don't have any common benchmarks. I think a simple case would be the following
benchmarks
directory to prompttools
prompttools/data
Some benchmarks to start with would be HellaSwag and TruthfulQA, or perhaps simpler ones like ROGUE and BLEU
Feel free to deviate from this plan, it's just a suggestion for how to get started.
Thanks @steventkrawczyk for the guidance, based on my initial research I have found a package "Evaluate:" that can provide the methods for evaluating the model Link to package: https://huggingface.co/docs/evaluate/index I was thinking to use it.
Please free to suggest better ways as I am new to ML stuff but love to contribute
@steventkrawczyk, can we use the "Datasets" library for loading metrics dataset instead of creating a separate directory Link to the library: https://github.com/huggingface/datasets
And it can also be used for quick tests on a prebuilt dataset
@LuvvAggarwal using datasets sounds like a good start. As far as using evaluate, we want to write our own eval methods that support more than just huggingface (e.g. OpenAI, Anthropic)
@steventkrawczyk Sure, but I have no idea about eval methods it would be great if you can share any references so I could code. Thanks
For example, if you are using the hellaswag dataset, we need to compute the accuracy of predictions, e.g. https://github.com/openai/evals/blob/main/evals/metrics.py#L12
@LuvvAggarwal I kick started the code for benchmarks here if you would like to branch: https://github.com/hegelai/prompttools/pull/72
Thanks @HashemAlsaket, I will branch it
🚀 The feature
We need to add benchmark test sets so folks can run on models / embeddings / systems
A few essentials:
Motivation, pitch
Users have told us that they want to run academic benchmarks as "smoke tests" on new models.
Alternatives
No response
Additional context
No response