Draft: Test response quality

HashemAlsaket commented 1 year ago

Starting as a draft to get an idea of how #10 should mature. I started with a unit test-style testing.

User changes config to liking: choosing which models they want, temperatures, max_lengths, scores, etc.
If the unit test does not pass, don't include the combination in further development

@steventkrawczyk @NivekT

HashemAlsaket commented 1 year ago

Added starter code for getting best response from set of LLMs and params. Example:

from prompttools.experiment.llms.compare_responses import MaxLLM

hf_model_repo_ids = [
    "google/flan-t5-xxl",
    "databricks/dolly-v2-3b",
    "bigscience/bloom",   
]

temperatures = [0.01, 1.0]

max_lengths = [17, 32]

LLMs = MaxLLM(
            hf_repo_ids=hf_model_repo_ids,
            temperatures=temperatures,
            max_lengths=max_lengths,
            question="Who was the first president of the USA?",
            expected="George Washington",
        )

LLMs.run()

Output:

LLMs.best_response()

Response(repo_id='google/flan-t5-xxl', temperature=0.01, max_length=17, score=1.0000001192092896, response='george washington')

LLMs.top_n_responses(n=9)

[Response(repo_id='google/flan-t5-xxl', temperature=0.01, max_length=17, score=1.0000001192092896, response='george washington'),
 Response(repo_id='google/flan-t5-xxl', temperature=0.01, max_length=32, score=1.0000001192092896, response='george washington'),
 Response(repo_id='google/flan-t5-xxl', temperature=1.0, max_length=17, score=1.0000001192092896, response='george washington'),
 Response(repo_id='google/flan-t5-xxl', temperature=1.0, max_length=32, score=1.0000001192092896, response='george washington'),
 Response(repo_id='databricks/dolly-v2-3b', temperature=0.01, max_length=32, score=0.6773853898048401, response='\nPresident George Washington\n\nPresident Washington was the first president of the United States'),
 Response(repo_id='databricks/dolly-v2-3b', temperature=1.0, max_length=32, score=0.6773853898048401, response='\nPresident George Washington\n\nPresident Washington was the first president of the United States'),
 Response(repo_id='bigscience/bloom', temperature=0.01, max_length=17, score=0.6547670364379883, response=' George Washington\n        Question: What is the capital of the USA?\n        Answer:  Washington, DC\n       '),
 Response(repo_id='bigscience/bloom', temperature=0.01, max_length=32, score=0.2622990906238556, response=' George Washington\n        """\n        self.assertEqual(self.parser.parse(question), [(\''),
 Response(repo_id='bigscience/bloom', temperature=1.0, max_length=17, score=0.2622990906238556, response=' George Washington\n        """\n        self.assertEqual(self.parser.parse(question), [(\'')]

CLAassistant commented 1 year ago

All committers have signed the CLA.

HashemAlsaket commented 1 year ago

@steventkrawczyk ready for review. Needed a few iterations to get acclimated with the code. Good now.

Output looks good, too. prompttools_output

steventkrawczyk commented 1 year ago

Looks great! Would you be able to fill out the CLA? Then I can merge this change in later today 🚀

https://cla-assistant.io/hegelai/prompttools?pullRequest=14

HashemAlsaket commented 1 year ago

Sounds good. I think I can use a similar template to include anthropic, azure, etc. I'll try to write up some issues for them today.

hegelai / prompttools

Draft: Test response quality #14