Open okhat opened 8 months ago
This would be a wonderful feature. I have 3 things in mind (not sure about how well this can be implemented as I don't yet have complete source code level understanding):
predict(input, dry_run=True)
. We need to have a convention of how we can separate the args from these settings. But it will be good if we don't have to change settings at a global level but rather at a module level, predict
in dry_run
mode and then run it without dry_run
in case we are happy. This is great in a Notebook/REPL environment.BootstrapFewShot(metric=...).compile(..., step=True)
, which shows progress after each loop and asks if we want to continue. If we don't intend to continue, we can still work with the optimisations that have been obtained until the current cycle.explain
function that shows the estimation of the number of calls / tokens: BootstrapFewShot(...).explain()
. Again, I am thinking from the perspective of running these in Notebooks where we develop the programs incrementally.I've started to work on some of these - https://github.com/stanfordnlp/dspy/pull/408
Not going in the same order, but here's what I have so far:
- A stepper/debugger at a teleprompter level: BootstrapFewShot(metric=...).compile(..., step=True), which shows progress after each loop and asks if we want to continue. If we don't intend to continue, we can still work with the optimisations that have been obtained until the current cycle.
dspy/teleprompt/bootstrap.py
- def compile(self, student, *, teacher=None, trainset, valset=None):
+ def compile(self, student, *, teacher=None, trainset, valset=None, step=False):
self.trainset = trainset
self.valset = valset
- def _bootstrap(self, *, max_bootstraps=None):
+ def _bootstrap(self, *, max_bootstraps=None, step=False):
max_bootstraps = max_bootstraps or self.max_bootstrapped_demos
bootstrapped = {}
self.name2traces = {name: [] for name in self.name2predictor}
for round_idx in range(self.max_rounds):
for example_idx, example in enumerate(tqdm.tqdm(self.trainset)):
if len(bootstrapped) >= max_bootstraps:
break
if example_idx not in bootstrapped:
success = self._bootstrap_one_example(example, round_idx)
if success:
bootstrapped[example_idx] = True
+ if step:
+ user_input = input("Continue bootstrapping? (Y/n): ")
+ if user_input.lower() == 'n':
+ print("Bootstrapping interrupted by user.")
+ return # Exit the loop and method
print(f'Bootstrapped {len(bootstrapped)} full traces after {example_idx+1} examples in round {round_idx}.')
This seems pretty straightforward. Just adding step
to .compile
and ._bootstrap
does the job. I'm nervous about the interaction with Notebooks and running this via the command line though. Requesting user inputs in a notebook environment has been tricky for me in the past.
3.A dry run at a module level that can show us the prompts that it is planning to submit to the LM: predict(input, dry_run=True). We need to have a convention of how we can separate the args from these settings. But it will be good if we don't have to change settings at a global level but rather at a module level, predict in dry_run mode and then run it without dry_run in case we are happy. This is great in a Notebook/REPL environment.
By changing Predict.forward
to look for dry_run
, we can add in a check if the user wants to perform a dry run or not
dspy/predict/predict.py
class Predict(Parameter):
...
def forward(self, **kwargs):
# Extract the three privileged keyword arguments.
new_signature = kwargs.pop("new_signature", None)
signature = kwargs.pop("signature", self.signature)
demos = kwargs.pop("demos", self.demos)
+ dry_run = kwargs.pop("dry_run", False)
...
+ if dry_run:
+ # Prepare a structured output for the dry run
+ dry_run_info = {
+ 'prompt': x, # The prepared prompt
+ 'config': config, # The configuration used for generation
+ 'signature': str(signature), # The signature being used
+ 'stage': self.stage, # The current stage
+ }
+
+ # If an encoder is available, include encoded tokens in the output
+ encoder = dsp.settings.config.get('encoder', None)
+ if encoder is not None:
+ encoded_tokens = encoder.encode(x)
+ dry_run_info['encoded_tokens'] = encoded_tokens
+ dry_run_info['token_count'] = len(encoded_tokens)
+ # Option 1: Return the dry run information for further inspection
+ return dry_run_info
If they do want to perform a dry run, we use tiktoken
to find the number of tokens, and return that with the prompt.
dsp/utils/utils.py
import tqdm
import datetime
import itertools
+import tiktoken
from collections import defaultdict
...
+def load_encoder_for_lm(lm):
+ """
+ Load and cache the tiktoken encoder based on the LM configuration.
+ Args:
+ lm: The language model configuration.
+ """
+ # Load the encoder. This is a placeholder; adjust based on how you actually load the encoder.
+ encoder = tiktoken.encoding_for_model(lm)
+ # Cache the encoder
+ return encoder
Do note that this only works for OpenAI models, since we're using tiktoken. To make it more robust, we ought to consider the suite of models that are valid, and then expand load_encoder_for_lm
to handle each case. I think that should be done before (3).
Let me know your thoughts. Happy to continue working on this.
One thought on Token Counting - would it make sense to build this directly into the LM abstraction?
I imagine, while there may be some overlap in tokenization model to model, it may be cleaner to pair the tokenization directly with the provider.
One thought on Token Counting - would it make sense to build this directly into the LM abstraction?
I imagine, while there may be some overlap in tokenization model to model, it may be cleaner to pair the tokenization directly with the provider.
I think this ties to the refactoring work that is being discussed here: #390 and I agree that it would be good to think of a generic solution which will work when integrating other open source models.
Is there a way to do this without adding more special features to the Predict
class?
It already has a lot of "magical" key word arguments.
Could it maybe be done using something like:
with dspy.dryrun():
optimizer.compile(...)
Where the dryrun()
is implemented by replacing the dsp.lm
with a wrapped language model that does what dry_run does here?
I love the explorations here. Will look more closely tomorrow most likely BUT:
the main challenge on this issue is possibly unaddressed which is that a lot of the optimizer logic is complex and data-dependent
For example, bootstrap few shot will stop after it labels enough training examples — it won’t try to trace them all unnecessarily, depending on the metric
Unclear how to dryrun that behavior…
If the tokens used is non-determistic based on the optimization, could it provide value if we simply collected a broad sample of the number of optimization calls, and use it to estimate a likely range, as opposed to a single estimate?
Great idea @KCaverly — yeah this also brings up having a “budget”. I imagine saying: “ please don’t make more than 10,000 requests and don’t cost me more than $4 on this run”
To be clear budgets are outside the scope of dryruns but they’re related
This may need to be done on a per-optimizer basis but it may be good to think of a dev UX that involves showing an estimation upfront of the number of calls / tokens in an optimization run, and possibly asking for confirmation if
dspy.settings.confirm_first
(this doesn't exist yet) is True or something to that effect.