stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
13.82k stars 1.06k forks source link

Best practice for train and validation set separation #1181

Closed denisergashbaev closed 4 days ago

denisergashbaev commented 1 week ago

Hello! I am interested to know, how we should approach the compilation step. I thought of the following but not sure whether it is a correct practice:

Now, some questions:

Thank you!

okhat commented 1 week ago

Hey @denisergashbaev , which version of DSPy are you on?

In general, when using DSPy there are four data splits to keep in mind: Train, Validation, Development, and Test. (Many optimizers only take 'train' and then internally re-split it to train and validation.)

In the most general case, optimizers are free to do anything with Train and Validation, because they're either 'training' (on Train) or 'hyperparameter tuning' (on Validation). Typically, optimizers should not be provided any access to Development (which you can use to tweak your algorithm), except in very low-data regimes where Validation = Development. Test is test, it's held out for final evaluation.

In practice, optimizers will generally not use Validation for direct fitting, but only for blackbox optimization. This is not guaranteed but it's the case throughout every current optimizer. The instance you saw of BootstrapFewShot using validation is just a bug resulting from a using an undocumented path in the code. (Unlike other teleprompters, BootstrapFewShot is not an optimizer, it's just a meta-prompting approach, so it shouldn't even be given a validation set. The bugfix was to remove the ability to even provide a valset to it. The name valset was being overriden for other uses.)

This was fixed a long time ago, though, so make sure you're on a recent version of DSPy.

Is it ok, if the training set examples come from a different distribution than validation dataset. My training set examples are shorter

Which distribution do you ultimately care about? Make sure you have that in your dev set and track progress until you're satisfied.

denisergashbaev commented 1 week ago

Hello @okhat thank you very much for your response!

Hey @denisergashbaev , which version of DSPy are you on?

I am using DSPy v2.4.9 and could reproduce the above error with it. Here is the code that I have used:

    from dspy.teleprompt import BootstrapFewShot
    bfs_optimizer = BootstrapFewShot(metric=metric, teacher_settings=teacher_settings, max_bootstrapped_demos=3, max_labeled_demos=len(train_set), max_rounds=1, max_errors=0)
    page_data_extractor = bfs_optimizer.compile(page_data_extractor, trainset=train_set, valset=val_set)

If I inspect the json file for compiled prompt, I can see that some examples from the validation set end up in there.

Also, BootstrapFewShot documentation mentions valset explicitly as well.

train, validation, dev, test datasets

Let me rephrase your answer to make sure I understand it properly. Could you please correct if I am wrong:

very low-data regimes where Validation = Development

What would be your estimate for a low data regime that would necessitate validation=dev dataset?

Thank you

Gsizm commented 5 days ago

Omar (@okhat), could you please look at @denisergashbaev query above if you could. I happen to have a similar one.

okhat commented 4 days ago

Hey @denisergashbaev and @Gsizm,

Thanks for the note on the docs page. I've removed the mention of valset from there: it wasn't a correct reference, BootstrapFewShot is not an optimizer (it's an auto-prompting technique, or a non-optimizing teleprompter) and as such has no proper use for a validation set.

BootstrapFewShotWithRandomSearch, on the other hand, is an optimizer. You can and should give it separate trainset and valset. It will build examples from trainset and will score candidate programs on the valset.

If you have several hundreds of examples, I recommend using a devset != valset and not passing devset to any optimizers. That way, you have a way to measure progress before you eventually evaluate on the held-out test set. That said, using valset == devset is often OK too, especially if total data is less than 200-400 examples.

Only two rules are crucial:

okhat commented 4 days ago

I assume this is resolved. Feel free to re-open if necessary. I forgot to add that DSPy 2.4.10 should certainly not have a valset argument in BootstrapFewShot; we removed that field in April iirc. Let me know if it's still there or if you have any other thoughts or questions.