Best practice for train and validation set separation

denisergashbaev commented 1 week ago

Hello! I am interested to know, how we should approach the compilation step. I thought of the following but not sure whether it is a correct practice:

provide to optimizer train and validation dataset
then, use the compiled module and evaluator to run it on the validation dataset. This should be the reported metric.

Now, some questions:

I have noticed, that BootstrapFewShot optimizer added examples from the validation dataset to the demos in the json file. Isn´t it overfitting?
Is it ok, if the training set examples come from a different distribution than validation dataset. My training set examples are shorter
How do we prevent overfitting to validation set, as the dataset is not split randomly to train/validation?

Thank you!

okhat commented 1 week ago

Hey @denisergashbaev , which version of DSPy are you on?

In general, when using DSPy there are four data splits to keep in mind: Train, Validation, Development, and Test. (Many optimizers only take 'train' and then internally re-split it to train and validation.)

In the most general case, optimizers are free to do anything with Train and Validation, because they're either 'training' (on Train) or 'hyperparameter tuning' (on Validation). Typically, optimizers should not be provided any access to Development (which you can use to tweak your algorithm), except in very low-data regimes where Validation = Development. Test is test, it's held out for final evaluation.

In practice, optimizers will generally not use Validation for direct fitting, but only for blackbox optimization. This is not guaranteed but it's the case throughout every current optimizer. The instance you saw of BootstrapFewShot using validation is just a bug resulting from a using an undocumented path in the code. (Unlike other teleprompters, BootstrapFewShot is not an optimizer, it's just a meta-prompting approach, so it shouldn't even be given a validation set. The bugfix was to remove the ability to even provide a valset to it. The name valset was being overriden for other uses.)

This was fixed a long time ago, though, so make sure you're on a recent version of DSPy.

Is it ok, if the training set examples come from a different distribution than validation dataset. My training set examples are shorter

Which distribution do you ultimately care about? Make sure you have that in your dev set and track progress until you're satisfied.

denisergashbaev commented 1 week ago

Hello @okhat thank you very much for your response!

Hey @denisergashbaev , which version of DSPy are you on?

I am using DSPy v2.4.9 and could reproduce the above error with it. Here is the code that I have used:

    from dspy.teleprompt import BootstrapFewShot
    bfs_optimizer = BootstrapFewShot(metric=metric, teacher_settings=teacher_settings, max_bootstrapped_demos=3, max_labeled_demos=len(train_set), max_rounds=1, max_errors=0)
    page_data_extractor = bfs_optimizer.compile(page_data_extractor, trainset=train_set, valset=val_set)

If I inspect the json file for compiled prompt, I can see that some examples from the validation set end up in there.

Also, BootstrapFewShot documentation mentions valset explicitly as well.

train, validation, dev, test datasets

Let me rephrase your answer to make sure I understand it properly. Could you please correct if I am wrong:

to most optimizers we should only provide trainset. They will split it into train (for training the optimizer) and validation (for hyperparameter optimization). Unless we explicitly provide validation set to the optimizer, it would do the split by itself
dev set should be used by the developer only. For instance, for manual prompt adjustments or for architecting the modules (ie, splitting the program into modules or deciding how many signatures a module may comprise)
- how, in this case we should validate the work performed based on devset? via another validation set?

very low-data regimes where Validation = Development

What would be your estimate for a low data regime that would necessitate validation=dev dataset?

Thank you

Gsizm commented 5 days ago

Omar (@okhat), could you please look at @denisergashbaev query above if you could. I happen to have a similar one.

okhat commented 4 days ago

Hey @denisergashbaev and @Gsizm,

Thanks for the note on the docs page. I've removed the mention of valset from there: it wasn't a correct reference, BootstrapFewShot is not an optimizer (it's an auto-prompting technique, or a non-optimizing teleprompter) and as such has no proper use for a validation set.

BootstrapFewShotWithRandomSearch, on the other hand, is an optimizer. You can and should give it separate trainset and valset. It will build examples from trainset and will score candidate programs on the valset.

If you have several hundreds of examples, I recommend using a devset != valset and not passing devset to any optimizers. That way, you have a way to measure progress before you eventually evaluate on the held-out test set. That said, using valset == devset is often OK too, especially if total data is less than 200-400 examples.

Only two rules are crucial:

The final test set should not overlap with any sets used for training or validation or development.
Don't have trainset == valset == devset, but if you know what you're doing you can have any 2 of the three.

okhat commented 4 days ago

I assume this is resolved. Feel free to re-open if necessary. I forgot to add that DSPy 2.4.10 should certainly not have a valset argument in BootstrapFewShot; we removed that field in April iirc. Let me know if it's still there or if you have any other thoughts or questions.

stanfordnlp / dspy

Best practice for train and validation set separation #1181