stanfordnlp / dspy

DSPy: The framework for programming—not prompting—language models
https://dspy.ai
MIT License
19.22k stars 1.46k forks source link

Bootstrap KNN with Random Search #1825

Open CyrusOfEden opened 6 days ago

CyrusOfEden commented 6 days ago

Current KNNFewShot

Compile Time: Vectorize trainset examples Test Time:

  1. When encountering a new example (which is the typical case)
  2. Run BootstrapFewShot to test out different static few-shot demos for each predictor
  3. Find the best program of static few-shot examples

New BootstrapKNN

Compile Time: Run BootstrapFewShot to collect traces and generate end-to-end demo sets Test Time: When a predictor is called, KNN of the input are few-shotted using the augmented demos for that predictor

New BootstrapKNNWithRandomSearch

Compile Time:

  1. Try zero-shot
  2. Try LabeledFewShot
  3. Try BootstrapKNN
  4. Try BootstrapKNN with random # of static demos fixed using the augmented demos for that predictor, and that predictor's demos are shuffled(static demos + knn demos)

Test Time: When a predictor is called, KNN of the input are few-shotted using the augmented demos for that predictor. If num_static_demos ≠ 0, then that predictor's demos are shuffled(static demos + knn demos) such that len(static demos + knn demos) == max_labeled_demos

CyrusOfEden commented 3 days ago

Some preliminary results on the hover_retrieve_discrete optimizer test

config = {
    "max_bootstrapped_demos": 64,
    "max_labeled_demos": 16,
    "max_errors": 10,
    "num_candidate_programs": 16,
}

BootstrapKNNWithRandomSearch: 47%

Best program had 8 static few shots and 8 KNN'd few shots

Average train score: 37.316

Scores so far: [30.0, 30.0, 37.5, 37.0, 39.5, 38.0, 38.5, 37.0, 39.0, 39.0, 39.5, 37.0, 38.0, 41.0, 38.0, 38.0, 36.5, 38.0, 37.5]
Best score so far: 41.0
19 candidate programs found.
Optimized train score...
Average Metric: 82.00 / 200 (41.0%): 100%|██████████| 200/200 [00:23<00:00,  8.38it/s]
2024/11/22 20:48:42 INFO dspy.evaluate.evaluate: Average Metric: 82 / 200 (41.0%)

Optimized dev score...
Average Metric: 46.00 / 100 (46.0%): 100%|██████████| 100/100 [02:05<00:00,  1.26s/it]
2024/11/22 20:50:48 INFO dspy.evaluate.evaluate: Average Metric: 46 / 100 (46.0%)

Optimized test score...
Average Metric: 94.00 / 200 (47.0%): 100%|██████████| 200/200 [03:59<00:00,  1.20s/it]
2024/11/22 20:54:47 INFO dspy.evaluate.evaluate: Average Metric: 94 / 200 (47.0%)

BootstrapFewShotWithRandomSearch: 43%

Average train score: 36.078947368421055

Scores so far: [30.0, 30.0, 34.5, 41.0, 33.0, 38.5, 37.5, 36.0, 38.0, 35.0, 36.0, 35.5, 38.0, 36.0, 36.5, 41.0, 35.5, 39.0, 34.5]
Best score so far: 41.0
19 candidate programs found.
Optimized train score...
Average Metric: 82.00 / 200 (41.0%): 100%|██████████| 200/200 [00:00<00:00, 677.59it/s]
2024/11/22 22:47:40 INFO dspy.evaluate.evaluate: Average Metric: 82 / 200 (41.0%)

Optimized dev score...
Average Metric: 45.00 / 100 (45.0%): 100%|██████████| 100/100 [02:14<00:00,  1.35s/it]
2024/11/22 22:49:55 INFO dspy.evaluate.evaluate: Average Metric: 45 / 100 (45.0%)

Optimized test score...
Average Metric: 86.00 / 200 (43.0%): 100%|██████████| 200/200 [04:29<00:00,  1.35s/it]
2024/11/22 22:54:24 INFO dspy.evaluate.evaluate: Average Metric: 86 / 200 (43.0%)
ryanh-ai commented 2 days ago

Excited to see this merge.