Add Prompt testing using a LLM testing suite

vmesel commented 6 months ago

We need to enable our users to prevent having regressions on their prompt without noticing it clearly. In order to achieve this, we must implement a way for our users to run a test suite on their own test cases with a user set similarity score. This must be simply set and must be extensible to use beyond toml files.

avelino commented 6 months ago

https://github.com/confident-ai/deepeval?tab=readme-ov-file#writing-your-first-test-case

this week @lgabs shared this project with me, apparently we were able to use it to test our prompts

vmesel commented 6 months ago

Yes, I saw the README on ragtalks and forgot to attach the link, thats the library I was looking for.

On Sun, 7 Apr 2024 at 10:05 Avelino @.***> wrote:

https://github.com/confident-ai/deepeval?tab=readme-ov-file#writing-your-first-test-case

this week @lgabs https://github.com/lgabs shared this project with me, apparently we were able to use it to test our prompts

— Reply to this email directly, view it on GitHub https://github.com/talkdai/dialog/issues/174#issuecomment-2041464624, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGA2UYJKDWV5JV4VIOMKCLY4FADHAVCNFSM6AAAAABF25SA3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGQ3DINRSGQ . You are receiving this because you authored the thread.Message ID: @.***>

lgabs commented 6 months ago

Yeah, I couldn't study too much of llm evals, but I did check that the way the community seams to evaluate llms applications is using several standard metrics such that another llm evaluates the llm apppication outputs against expected results (this evaluator could even be a free local llm, for a task is much simpler).

I saved this Andre Ng short course to see soon, maybe it'll help 🤘

lgabs commented 6 months ago

Also, I think it would be a good idea for us to use the same common dataset for local dev, and also for tests that depend on dataset (generate embeddings or even these llm evals), one idea is to download some one from hugging face like this wiki_qa. What do you think? This would a new issue, of course.

These llm evals change a lot depending on the domain, so i think we could just make good documentation about how to add these tests without trying to write generic cases for all datasets.

vmesel commented 6 months ago

So I was thinking on making it available for the developer to create new test cases, not necessarily our software writing generic test cases.

Think of having just a single test case, instead of having to implement it by yourself inside dialog, you can just write a new toml file that could enable this feature.

On using other LLMs, I’m not aware on what should be implemented in those cases, going to watch the video here to research more about this.

On Sun, 7 Apr 2024 at 10:36 Luan Fernandes @.***> wrote:

Also, I think it would be a good idea for us to use the same common dataset for local dev, and also for tests that depend on dataset (generate embeddings or even these llm evals), one idea is to download some one from hugging face like this [wiki_qa](https://huggingface.co/datasets/wiki_qa. What do you think? This would a new issue, of course.

These llm evals change a lot depending on the domain, so i think we could just make good documentation about how to add these tests without trying to write generic cases for all datasets.

— Reply to this email directly, view it on GitHub https://github.com/talkdai/dialog/issues/174#issuecomment-2041474214, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGA2U2FJT2U72XBOKCJD63Y4FDVFAVCNFSM6AAAAABF25SA3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGQ3TIMRRGQ . You are receiving this because you authored the thread.Message ID: @.***>

talkdai / dialog

Add Prompt testing using a LLM testing suite #174