Closed dbczumar closed 1 month ago
Thanks @dbczumar ! This is actually an interesting problem.
The way to debug this is to do dspy.inspect_history(n=5)
at the end to view the last five prompts.
When you do that, you see:
System message:
Your input fields are:
1. `text` (str): Text to tokenize
Your output fields are:
1. `tokens` (list[str]): A list of tokens extracted from the text
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## text ## ]]
{text}
[[ ## tokens ## ]]
{tokens}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `text`, produce the fields `tokens`.
User message:
[[ ## text ## ]]
Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .
Respond with the corresponding output fields, starting with the field `tokens`, and then ending with the marker for `completed`.
Assistant message:
[[ ## tokens ## ]]
[1] «Germany»
[2] «'s»
...
[30] «clearer»
[31] «.»
[[ ## completed ## ]]
User message:
[[ ## text ## ]]
The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .
Respond with the corresponding output fields, starting with the field `tokens`, and then ending with the marker for `completed`.
Response:
[[ ## tokens ## ]]
[1] «The»
[2] «European»
...
[30] «.»
[[ ## completed ## ]]
In zero-shot mode, this model understands the format of the task enough to output a list.
When you're about to bootstrap, it sees that you do have pre-labeled text/tokens pairs, so it formats them before bootstrapping as "labeled" demos, with the goal of producing (good) "bootstrapped" demos.
Now, in 2.5, lists are formatted by default into this numbered list. This is handy for input fields. I guess for output fields, we may want to raise an exception if you feed us any non-str, or maybe we call str(.) for you. But I'm not sure that's better either.
Maybe the right thing to do is to look at whether the user supplied a type annotation for the output field. If they did, we should format their labels in a way that would lead to correct parsing. That's the right thing here, yup.
To illustrate the scope of the problem, this is a temporary fix that does work:
tokenizer_train_set = [
dspy.Example(
text=get_input_text(data_row),
tokens=str(data_row["tokens"]) # note the cast here
).with_inputs("text")
for data_row in train_data
]
def validate_tokens(expected_tokens, predicted_tokens, trace=None):
import ast
return ast.literal_eval(expected_tokens.tokens) == predicted_tokens.tokens
An even shorter, independent fix (that doesn't require the stuff above) is to pass max_labeled_demos=0
to the optimizer.
But of course let's figure out how to make this a general fix.
Aside: Using BootstrapFewShot with Predict makes for a great bug report. But in general using BootstrapFS with Predict (i.e., without any intermediate steps like a chain of thought) and with an exact match metric (==
) is not very powerful, since it can at most just re-build pre-existing examples where the model does the expected thing.
So ultimately there are three tasks here.
In the very short term, we should be more careful with formatting output fields, especially (only?) if the user supplied a type annotation for them. We should format them in a way that would parse, if they are already objects of the right type. If they're strings, we should check that they would parse.
In the short term, we should think about the right abstractions for formatting fields. I think it will heavily depend on the type annotations (whether the user supplied them, and what the type is), on whether the field is an input or an output, and on whether the value has duck-typed methods like .format()
or accepts str(.)
. Users can bypass all of this by pre-formatting and giving us a string, but in practice they will not do that very often. They would expect this all to "just work" as long as their types are reasonable. I think philosophically we should be about as dynamic as Python (or, if we want to be stricter, Pydantic) given the library's roots, but no more.
Less critically, in the longer term, we should re-consider how BootstrapFS uses labeled demos under the hood. It's controlled by a flag (max_labeled_demos), but it could be more versatile, e.g. if it leads to errors it can automatically avoid the examples?
Script
Logs output