Optimization dry runs - Githubissues

okhat commented 8 months ago

This may need to be done on a per-optimizer basis but it may be good to think of a dev UX that involves showing an estimation upfront of the number of calls / tokens in an optimization run, and possibly asking for confirmation if dspy.settings.confirm_first (this doesn't exist yet) is True or something to that effect.

buzypi commented 8 months ago

This would be a wonderful feature. I have 3 things in mind (not sure about how well this can be implemented as I don't yet have complete source code level understanding):

A dry run at a module level that can show us the prompts that it is planning to submit to the LM: predict(input, dry_run=True). We need to have a convention of how we can separate the args from these settings. But it will be good if we don't have to change settings at a global level but rather at a module level, predict in dry_run mode and then run it without dry_run in case we are happy. This is great in a Notebook/REPL environment.
A stepper/debugger at a teleprompter level: BootstrapFewShot(metric=...).compile(..., step=True), which shows progress after each loop and asks if we want to continue. If we don't intend to continue, we can still work with the optimisations that have been obtained until the current cycle.
An explain function that shows the estimation of the number of calls / tokens: BootstrapFewShot(...).explain(). Again, I am thinking from the perspective of running these in Notebooks where we develop the programs incrementally.

smwitkowski commented 8 months ago

I've started to work on some of these - https://github.com/stanfordnlp/dspy/pull/408

Not going in the same order, but here's what I have so far:

A stepper/debugger at a teleprompter level: BootstrapFewShot(metric=...).compile(..., step=True), which shows progress after each loop and asks if we want to continue. If we don't intend to continue, we can still work with the optimisations that have been obtained until the current cycle.

dspy/teleprompt/bootstrap.py

-   def compile(self, student, *, teacher=None, trainset, valset=None):
+   def compile(self, student, *, teacher=None, trainset, valset=None, step=False):
        self.trainset = trainset
        self.valset = valset

-   def _bootstrap(self, *, max_bootstraps=None):
+   def _bootstrap(self, *, max_bootstraps=None, step=False):
        max_bootstraps = max_bootstraps or self.max_bootstrapped_demos

        bootstrapped = {}
        self.name2traces = {name: [] for name in self.name2predictor}
        for round_idx in range(self.max_rounds):
            for example_idx, example in enumerate(tqdm.tqdm(self.trainset)):
                if len(bootstrapped) >= max_bootstraps:
                    break
                if example_idx not in bootstrapped:
                    success = self._bootstrap_one_example(example, round_idx)

                    if success:
                        bootstrapped[example_idx] = True
+           if step:
+               user_input = input("Continue bootstrapping? (Y/n): ")
+               if user_input.lower() == 'n':
+                   print("Bootstrapping interrupted by user.")
+                   return  # Exit the loop and method

        print(f'Bootstrapped {len(bootstrapped)} full traces after {example_idx+1} examples in round {round_idx}.')

This seems pretty straightforward. Just adding step to .compile and ._bootstrap does the job. I'm nervous about the interaction with Notebooks and running this via the command line though. Requesting user inputs in a notebook environment has been tricky for me in the past.

3.A dry run at a module level that can show us the prompts that it is planning to submit to the LM: predict(input, dry_run=True). We need to have a convention of how we can separate the args from these settings. But it will be good if we don't have to change settings at a global level but rather at a module level, predict in dry_run mode and then run it without dry_run in case we are happy. This is great in a Notebook/REPL environment.

By changing Predict.forward to look for dry_run, we can add in a check if the user wants to perform a dry run or not

dspy/predict/predict.py

class Predict(Parameter):
        ...
    def forward(self, **kwargs):
        # Extract the three privileged keyword arguments.
        new_signature = kwargs.pop("new_signature", None)
        signature = kwargs.pop("signature", self.signature)
        demos = kwargs.pop("demos", self.demos)
+        dry_run = kwargs.pop("dry_run", False)

        ...

+       if dry_run:
+            # Prepare a structured output for the dry run
+            dry_run_info = {
+                'prompt': x,  # The prepared prompt
+                'config': config,  # The configuration used for generation
+                'signature': str(signature),  # The signature being used
+                'stage': self.stage,  # The current stage
+            }
+
+            # If an encoder is available, include encoded tokens in the output
+            encoder = dsp.settings.config.get('encoder', None)
+           if encoder is not None:
+                encoded_tokens = encoder.encode(x)
+                dry_run_info['encoded_tokens'] = encoded_tokens
+                dry_run_info['token_count'] = len(encoded_tokens)

+            # Option 1: Return the dry run information for further inspection
+            return dry_run_info

If they do want to perform a dry run, we use tiktoken to find the number of tokens, and return that with the prompt.

dsp/utils/utils.py

import tqdm
import datetime
import itertools
+import tiktoken

from collections import defaultdict

...

+def load_encoder_for_lm(lm):
+   """
+   Load and cache the tiktoken encoder based on the LM configuration.

+   Args:
+       lm: The language model configuration.
+   """

+   # Load the encoder. This is a placeholder; adjust based on how you actually load the encoder.
+   encoder = tiktoken.encoding_for_model(lm)
+   # Cache the encoder
+   return encoder

Do note that this only works for OpenAI models, since we're using tiktoken. To make it more robust, we ought to consider the suite of models that are valid, and then expand load_encoder_for_lm to handle each case. I think that should be done before (3).

Let me know your thoughts. Happy to continue working on this.

KCaverly commented 8 months ago

One thought on Token Counting - would it make sense to build this directly into the LM abstraction?

I imagine, while there may be some overlap in tokenization model to model, it may be cleaner to pair the tokenization directly with the provider.

buzypi commented 8 months ago

One thought on Token Counting - would it make sense to build this directly into the LM abstraction?

I imagine, while there may be some overlap in tokenization model to model, it may be cleaner to pair the tokenization directly with the provider.

I think this ties to the refactoring work that is being discussed here: #390 and I agree that it would be good to think of a generic solution which will work when integrating other open source models.

thomasahle commented 8 months ago

Is there a way to do this without adding more special features to the Predict class? It already has a lot of "magical" key word arguments.

Could it maybe be done using something like:

with dspy.dryrun():
    optimizer.compile(...)

Where the dryrun() is implemented by replacing the dsp.lm with a wrapped language model that does what dry_run does here?

okhat commented 8 months ago

I love the explorations here. Will look more closely tomorrow most likely BUT:

the main challenge on this issue is possibly unaddressed which is that a lot of the optimizer logic is complex and data-dependent

For example, bootstrap few shot will stop after it labels enough training examples — it won’t try to trace them all unnecessarily, depending on the metric

Unclear how to dryrun that behavior…

KCaverly commented 8 months ago

If the tokens used is non-determistic based on the optimization, could it provide value if we simply collected a broad sample of the number of optimization calls, and use it to estimate a likely range, as opposed to a single estimate?

okhat commented 8 months ago

Great idea @KCaverly — yeah this also brings up having a “budget”. I imagine saying: “ please don’t make more than 10,000 requests and don’t cost me more than $4 on this run”

okhat commented 8 months ago

To be clear budgets are outside the scope of dryruns but they’re related

stanfordnlp / dspy

Optimization dry runs #397