Using ThreadPoolExecutor to paralellize dspy calls breaks internal config management for threaded dspy.Evaluate.
Specifically doing this:
class QuestionAnswer(dspy.Signature):
question: str = dspy.InputField(description="The question")
answer: int = dspy.OutputField(description="The answer to the question")
solver = dspy.ChainOfThought(QuestionAnswer)
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(tqdm(executor.map(lambda x: solver(**x.inputs()), trainset), total=len(trainset)))
breaks
# This breaks, both models have the same score. The threads all run on gpt4o_mini which was the last configured
# model before the ThreadPoolExecutor was created.
evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=10, display_progress=True)
dspy.configure(lm=gpt4o)
evaluator(solver)
dspy.configure(lm=gpt4o_mini)
evaluator(solver)
# >>>>> gpt4o
# Average Metric: 47 / 50 (94.0): 100%|██████████| 50/50 [00:00<00:00, 1608.90it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)
# >>>>> gpt4o_mini
# Average Metric: 47 / 50 (94.0): 100%|██████████| 50/50 [00:00<00:00, 1795.91it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)
that is, all calls to dspy.configure will be ignored.
Turning off threading on dspy.Evaluate works properly:
# Turning off threading works as expected: both models have different scores again.
evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=1, display_progress=True)
dspy.configure(lm=gpt4o)
evaluator(solver)
dspy.configure(lm=gpt4o_mini)
evaluator(solver)
# ###### Output ######
# >>>>> gpt4o
# Average Metric: 48 / 50 (96.0): 100%|██████████| 50/50 [00:00<00:00, 892.83it/s]
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 48 / 50 (96.0%)
# >>>>> gpt4o_mini
# Average Metric: 47 / 50 (94.0): 100%|██████████| 50/50 [00:00<00:00, 976.47it/s]
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)
This is unfortunately known, and the best way to address it is:
As of the time of writing - Run everything inside an evaluate call first with a dummy metric (lambda x, y, z=None, w=None: 1) and return your outputs using the kwarg in Evaluate
Once #1690 merges, use that for parallel execution.
We should catch/warn in this situation @okhat @krypticmouse
Using
ThreadPoolExecutor
to paralellize dspy calls breaks internal config management for threadeddspy.Evaluate
.Specifically doing this:
breaks
that is, all calls to dspy.configure will be ignored.
Turning off threading on
dspy.Evaluate
works properly:See this notebook for full repro.