How to choose unlabelled data

Punchwes commented 3 years ago

Hi @timoschick, thanks very much for your work, I have a question about how you decide the unlabelled data for each task.

In the paper you say

Similarly, we construct the set D of unlabeled examples by selecting 10000 examples per label and removing all labels

Taking agnews as an example, I assume it means you take 40,000 examples (it has 4 classes in total) from training in total and there will be 10,000 examples for each class. However, in your code, it seems that you are not following the 10,000 examples per label thing by just shuffling and picking first 40,000 examples.

I am little bit confused about this, any clarification would be helpful.

timoschick commented 3 years ago

Hi @Punchwes, there are two options for limiting the number of unlabeled examples:

1) You can specify --unlabeled examples <k> for some natural number <k>, e.g. --unlabeled examples 40000. If you do so, the entire set of unlabeled examples is shuffled and then the first 40,000 examples in the shuffled dataset are chosen. Of course, this does not guarantee that there is an equal number of examples for each label.

2) You can specify --unlabeled examples <k> --split_examples_evenly for some natural number <k> as above. In this case, if your dataset has <n> labels, for each label, the first <k>/<n> examples that can be found in the (unshuffled) unlabeled dataset are chosen.

For our experiments on AG's News, we chose the second option (that is, --unlabeled examples 40000 --split_examples_evenly). If you wanted to combine both options (shuffle the dataset and select the same number of examples for each label), you'd have to implement this yourself, but it should not require more than one or two lines of code.

I hope this answers your question!

Punchwes commented 3 years ago

Hi @timoschick , thanks for your quick reply.

I think the method your describe in the paper corresponds to the second option, what confuses me is that in the code it seems that -split_examples_evenly never applies to unlabeled data.

As your code comment in tasks.py:

    assert (not set_type == UNLABELED_SET) or (num_examples is not None), \
        "For unlabeled data, 'num_examples_per_label' is not allowed"

and in the example loading part in cli.py:

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

there's no num_examples_per_label parameter passing to unlabeled_data loading. This is the reason why I am confused it seems that you would always choose the first option for unlabeled data.

    if args.split_examples_evenly:
        train_ex_per_label = eq_div(args.train_examples, len(args.label_list)) if args.train_examples != -1 else -1
        test_ex_per_label = eq_div(args.test_examples, len(args.label_list)) if args.test_examples != -1 else -1
        train_ex, test_ex = None, None

and unlabeled data seems not be involved in the split_examples_evenly part as I could see.

Or I missed something in the code where the --split_examples_evenly can be applied to unlabeled data.

timoschick commented 3 years ago

Oh right, my mistake, you are absolutely correct! For our AG's News results, we used an older version of the code (the corresponding file can still be found here). Back then, examples were always split evenly across all labels, so option (1) from my previous comment was not possible and option (2) was the default. When I wrote the current version of PET, I explicitly removed the num_examples_per_label option for unlabeled data because in a real-world setting, of course you do not have labels for unlabeled data so back then I felt like this was a sensible choice. But of course this means that with the current version of PET, option (2) from my previous comment is not possible anymore. So you'd have to either

1) modify the code by removing the assertion and applying the if args.split_examples_evenly: [...] code block also to unlabeled examples or 2) write a script that extracts the first 10,000 examples for each label and writes them to a separate file, and then use this separate file as input.

Sorry for the confusion!

Punchwes commented 3 years ago

Thanks very much for this clarification, very helpful and it makes sense to remove the option for unlabelled data.

One last question I have is about seed. You mentioned in the paper that:

each model is trained three times using different seeds and average results are reported

After checking the code, it seems that the seed parameter passed by command line (args.seed) is not used to choose data examples,

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

seed in the load_examples function is fixed as 42:

def load_examples(task, data_dir: str, set_type: str, *_, num_examples: int = None,
                  num_examples_per_label: int = None, seed: int = 42) -> List[InputExample]:

So I wonder when you run the model 3 times with different seeds, do you also change the seed in load_example() manually?

timoschick commented 3 years ago

For our experiments in Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference, we use the same set of examples for all three runs. The different seeds only affect the initialization of model parameters (for regular supervised training), dropout and the shuffling of training examples (i.e., the order in which they are presented to the model), which happens here.

If you're interested in how different sets of training examples affect performance, you might find Table 6 in this paper useful.

Punchwes commented 3 years ago

Thanks very much!

timoschick / pet

How to choose unlabelled data #26