Open wyclike opened 2 weeks ago
In our approach, we do not have separate sets of model parameters for each few-shot scenario. Instead, we use a single set of model parameters that is pretrained on a mixture of data across six different few-shot settings (0, 4, 8, 16, 32, and 64 shots).
Here's how we construct the data for pretraining:
The data construction is guided by several strategies to optimize learning:
Avoiding Duplication: For different few-shot scenarios, we expect that different test samples are used. This prevents the model from merely memorizing specific samples.
Fixed Group of Context Samples: Within a single few-shot scenario, we use a fixed group of context samples for all test samples. This strategy helps the model focus on identifying differences between the test samples given the same context.
I would like to inquire about how the few-shot approach is specifically incorporated into your pretraining process. For instance, the paper mentions six different few-shot scenarios with 0, 4, 8, 16, 32, and 64 shots. Does each entry in your pretraining dataset include few-shot content? If so, would that imply there are six sets of model parameters? And if they are included, how is the selection of few-shot instances determined — is it fixed or random? If not included, would it be correct to understand that you train a single set of zero-shot pretrain weights and then consider the various few-shot conditions during testing?