stanfordnlp / dspy

DSPy: The framework for programming—not prompting—language models
https://dspy.ai
MIT License
19.06k stars 1.46k forks source link

ThreadPoolExecutor breaks dspy.Evaluate config in parallel execution #1766

Open glesperance opened 2 weeks ago

glesperance commented 2 weeks ago

Using ThreadPoolExecutor to paralellize dspy calls breaks internal config management for threaded dspy.Evaluate.

Specifically doing this:

class QuestionAnswer(dspy.Signature):
    question: str = dspy.InputField(description="The question")
    answer: int = dspy.OutputField(description="The answer to the question")

solver = dspy.ChainOfThought(QuestionAnswer)

from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(tqdm(executor.map(lambda x: solver(**x.inputs()), trainset), total=len(trainset)))

breaks

# This breaks, both models have the same score. The threads all run on gpt4o_mini which was the last configured 
# model before the ThreadPoolExecutor was created.
evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=10, display_progress=True)

dspy.configure(lm=gpt4o)
evaluator(solver)

dspy.configure(lm=gpt4o_mini)
evaluator(solver)

# >>>>> gpt4o
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 1608.90it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)
# >>>>> gpt4o_mini
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 1795.91it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)

that is, all calls to dspy.configure will be ignored.

Turning off threading on dspy.Evaluate works properly:

# Turning off threading works as expected: both models have different scores again.

evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=1, display_progress=True)

dspy.configure(lm=gpt4o)
evaluator(solver)

dspy.configure(lm=gpt4o_mini)
evaluator(solver)

# ###### Output ######
# >>>>> gpt4o
# Average Metric: 48 / 50  (96.0): 100%|██████████| 50/50 [00:00<00:00, 892.83it/s] 
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 48 / 50 (96.0%)

# >>>>> gpt4o_mini
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 976.47it/s] 
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)

See this notebook for full repro.

isaacbmiller commented 2 weeks ago

This is unfortunately known, and the best way to address it is:

  1. As of the time of writing - Run everything inside an evaluate call first with a dummy metric (lambda x, y, z=None, w=None: 1) and return your outputs using the kwarg in Evaluate
  2. Once #1690 merges, use that for parallel execution.

We should catch/warn in this situation @okhat @krypticmouse