meaning of `DEV_FILE_NAME`

timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

https://arxiv.org/abs/2001.07676

Apache License 2.0

1.62k stars 283 forks source link

meaning of `DEV_FILE_NAME` #7

Closed chris-aeviator closed 3 years ago

chris-aeviator commented 3 years ago

Thanks for sharing this repo. When looking at the /examples dir, you split your dataset (labeled data?) to

DEV_FILE_NAME
TRAIN_FILE_NAME
TEST_FILE_NAME

& further

UNLABELED_FILE_NAME

Two questions arise: a) How do you split the labeled data (distribution, e.g. are you splitting 32 training examples from fewglue to DEV / TRAIN / TEST equally ?) b) will UNLABELED be automatically predicted and how is the result stored

timoschick commented 3 years ago

Hi @chris-aeviator ,

a) the labeled (training) data is not split at all. In case of fewglue, this means that TRAIN_FILE_NAME should point to a file containing all 32 examples, whereas DEV_FILE_NAME and TEST_FILE_NAME should point to files containing the original dev/test examples. Note that the dev examples are not at all used during training or for hyperparameter optimization; just like the test examples, they are only used for evaluation. If you have no dev examples, you can simply set def get_dev_examples(self, data_dir: str) to return an empty list.

b) yes, but only for the individual models and not for the final distilled classifier. If you need predictions for the unlabeled data, you can simply set TEST_FILE_NAME = UNLABELED_FILE_NAME. The result is then stored in a file predictions.jsonl where each line is of the form {"idx": <IDX>, "label": "<LABEL>"} where <IDX> is the index of the example in the test file and <LABEL> is the predicted label.

timoschick commented 3 years ago

I'm closing this issue for now. Feel free to reopen it if you have further questions.

aidahalitaj commented 6 months ago

Hi @timoschick ,

I am running PET for a custom task with --model_type bert . In the --data_dir I have 4 files train.csv, test.csv, dev.csv, unlabeled.csv.

In the shell script, I have: --do_train \ --do_eval

Now in the output, I always get the predictions.jsonl file. The UNLABELED_FILE_NAME = "unlabeled.csv", so it is not set to other datasets. However, in the predictions file I thought I was getting model predictions of dev.csv. I tested it with different number of samples for each file train/test/dev/unlabeled and the number of rows in predictions.jsonl matched with that of dev set. Is it by default predictions file (located in the final folder) showing the predictioins of dev.csv?

aidahalitaj commented 6 months ago

More info on what I said earlier...

@timoschick I run two similar experiments on the same dataset (playing with unlabeled sample size)

Experiment A settings:

Balanced Dataset
train (50 samples per class)
test (150 samples per class)
dev (150 samples per class)
unlabeled (10 samples per class)

predictions.jsonl file ha 300 predicted labels in total Experiment A predictions.jsonl has predictions labels (300 samples) of only one class

Experiment B settings:

Balanced Dataset
train (50 samples per class)
test (150 samples per class)
dev (150 samples per class)
unlabeled (100 samples per class)

predictions.jsonl file has 300 predicted labels in total Experiment B predictions.jsonl file has predictions labels from both classes

My task is a classification problem with two labels but I don't understand what's the role of unlabeled data in this case and why is it impacting the result.