preset-io / promptimize

Promptimize is a prompt engineering evaluation and testing toolkit.
Apache License 2.0
419 stars 32 forks source link

💡 ¡promptimize! 💡

License PyPI version

Promptimize is a prompt engineering evaluation and testing toolkit.

It accelerates and provides structure around prompt engineering at scale with confidence, bringing some of the ideas behind test-driven development (TDD) to engineering prompts.

With promptimize, you can:

In essence, promptimize provides a programmatic way to execute and fine-tune your prompts and evaluation functions in Python, allowing you to iterate quickly and with confidence.

Hello world - the simplest prompt examples

more examples on GitHub

# Brining some "prompt generator" classes - note that you can derive and extend those
from promptimize.prompts import PromptCase

# Bringing some useful eval function that help evaluating and scoring responses
# eval functions have a handle on the prompt object and are expected
# to return a score between 0 and 1
from promptimize import evals

# Promptimize will scan the target folder and find all Prompt objects
# and derivatives that are in the python modules
simple_prompts = [

    # Prompting "hello there" and making sure there's "hi" or "hello"
    # somewhere in the answer
    PromptCase("hello there!", lambda x: evals.any_word(x, ["hi", "hello"])),
    PromptCase(
        "name the top 50 guitar players!", lambda x: evals.all_words(x, ["frank zappa"])
    ),
]

The CLI's run command

$ promptimize run --help
Usage: promptimize run [OPTIONS] PATH

  run some prompts

Options:
  -v, --verbose             Trigger more verbose output
  -f, --force               Force run, do not skip
  -h, --human               Human review, allowing a human to review and force
                            pass/fail each prompt case
  -r, --repair              Only re-run previously failed
  -x, --dry-run             DRY run, don't call the API
  --shuffle                 Shuffle the prompts in a random order
  -s, --style [json|yaml]   json or yaml formatting
  -m, --max-tokens INTEGER  max_tokens passed to the model
  -l, --limit INTEGER       limit how many prompt cases to run in a single
                            batch
  -t, --temperature FLOAT   max_tokens passed to the model
  -e, --engine TEXT         model as accepted by the openai API
  -k, --key TEXT            The keys to run
  -o, --output PATH
  -s, --silent

Let's run those examples and produce a report ./report.yaml

$ promptimize run examples/ --output ./report.yaml
💡 ¡promptimize! 💡
# ----------------------------------------
# (1/2) [RUN] prompt: prompt-115868ef
# ----------------------------------------
key: prompt-115868ef
user_input: hello there!
prompt_hash: 115868ef
response: Hi there! How are you doing today?
execution:
  api_call_duration_ms: 883.8047981262207
  run_at: '2023-04-25T02:21:40.443077'
  score: 1.0

# ----------------------------------------
# (2/2) [RUN] prompt: prompt-5c085656
# ----------------------------------------
key: prompt-5c085656
user_input: name the top 10 guitar players!
prompt_hash: 5c085656
response: |-
  1. Jimi Hendrix
  2. Eric Clapton
  {{ ... }}
  11. Carlos Santana
weight: 2
execution:
  api_call_duration_ms: 2558.135747909546
  run_at: '2023-04-25T02:21:43.007529'
  score: 0.0

# ----------------------------------------
# Suite summary
# ----------------------------------------
suite_score: 0.3333333333333333
git_info:
  sha: 2cf28498ba0f
  branch: main
  dirty: true

Problem + POV

Thousands of product builders are currently trying to figure out how to bring the power of AI into the products and experiences they are building. The probabilistic (often semi-random, sometimes hectic) nature of LLMs makes this a challenge.

Prompt engineering is a huge piece of the puzzle in terms of how to do this right, especially given the complexity, risks, and drawbacks around model tuning.

We believe product builders need to tame AI through proper, rigorous prompt engineering. This allows making the probabilistic nature of AI more deterministic, or somewhat predictable, and allows builders to apply a hyperparameter tuning-type mindset and approach to prompt engineering.

Any prompt-generator logic that's going to be let loose in the wild inside a product should be thoroughly tested and evaluated with "prompt cases" that cover the breath of what people may do in a product.

In short, Promptimize allows you to test prompts at industrial scale, so that you can confidently use them in the products you are building.

Information Architecture

Principles

Interesting features / facts

Listing out a few features you should know about that you can start using as your suites of prompts become larger / more complex

Getting started

To install the Promptimize package, use the following command:

pip install promptimize

First you'll need an openai API key, let's set it as an env var

export OPENAI_API_KEY=sk-{{ REDACTED }}

Find the examples executed below here

# Clone the repo
git clone git@github.com:preset-io/promptimize.git
cd promptimize

# NOTE: CLI is `promptimize`, but `p9e` is a shorter synonym, can be used interchangibly
# First let's run some of the examples
p9e run ./examples

# Now the same but with verbose output
p9e run ./examples --verbose --output ./report.yaml

Langchain

How does promptimize relate to langchain?

We think langchain is amazing and promptimize uses langchain under the hood to interact with openai, and has integration with langchain (see LangchainPromptCase, and the upcoming LangchainChainPromptCase and LangchainAgntPromptCase) While you don't have to use langchain, and could use promptimize on top of any python prompt generation whether it'd be another library or some home grown thing.

Context

Where is promptimize coming from!? I'm (Maxime Beauchemin) a startup founder at Preset working on brining AI to BI (data exploration, and visualization). At Preset, we use promptimize to generate complex SQL based on natural language, and to suggest charts to users. We derive the SimpleQuery class to make it fitted to our specific use cases in our own prompt engineering repo. It's not my first open source project as the creator of Apache Superset and Apache Airflow

Contribute

This project is in its super early stages as of 0.2.0, and contributions, contributors, and maintainers are highly encouraged. While it's a great time to onboard and influence the direction of the project, things are still evolving quickly. To get involved, open a GitHub issue or submit a pull request!

Links