ThreadPoolExecutor breaks dspy.Evaluate config in parallel execution

Using ThreadPoolExecutor to paralellize dspy calls breaks internal config management for threaded dspy.Evaluate.

Specifically doing this:

class QuestionAnswer(dspy.Signature):
    question: str = dspy.InputField(description="The question")
    answer: int = dspy.OutputField(description="The answer to the question")

solver = dspy.ChainOfThought(QuestionAnswer)

from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(tqdm(executor.map(lambda x: solver(**x.inputs()), trainset), total=len(trainset)))

breaks

# This breaks, both models have the same score. The threads all run on gpt4o_mini which was the last configured 
# model before the ThreadPoolExecutor was created.
evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=10, display_progress=True)

dspy.configure(lm=gpt4o)
evaluator(solver)

dspy.configure(lm=gpt4o_mini)
evaluator(solver)

# >>>>> gpt4o
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 1608.90it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)
# >>>>> gpt4o_mini
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 1795.91it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)

that is, all calls to dspy.configure will be ignored.

Turning off threading on dspy.Evaluate works properly:

# Turning off threading works as expected: both models have different scores again.

evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=1, display_progress=True)

dspy.configure(lm=gpt4o)
evaluator(solver)

dspy.configure(lm=gpt4o_mini)
evaluator(solver)

# ###### Output ######
# >>>>> gpt4o
# Average Metric: 48 / 50  (96.0): 100%|██████████| 50/50 [00:00<00:00, 892.83it/s] 
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 48 / 50 (96.0%)

# >>>>> gpt4o_mini
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 976.47it/s] 
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)

See this notebook for full repro.

stanfordnlp / dspy

ThreadPoolExecutor breaks dspy.Evaluate config in parallel execution #1766