timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"
https://arxiv.org/abs/2001.07676
Apache License 2.0
1.62k stars 282 forks source link

Annotating an unlabeled set #1

Closed Mahhos closed 3 years ago

Mahhos commented 4 years ago

Hi. Thanks for the great repo. I have got a question regarding the PET training and annotating an unlabeled set (as mentioned in the paper examples from D). I assume that it would be done using the command in the PET Training and Evaluation section in the repo. However, I am not sure where to put the unlabeled set and where to get the predicted labels? Would you please let me know how we should get the predicted labels for the unlabeled set? Thank you.

timoschick commented 4 years ago

Hi, this is a bit difficult to do in the current version because PET expects the unlabeled examples to be in the same file as the labeled examples (this will be fixed in the next version, which will hopefully be released in ~2 weeks). What you can do until then is the following:

1) replace line 154 in run_training.py (all_train_data = load_examples(args.task_name, args.data_dir, args.lm_train_examples_per_label, evaluate=False)) with some custom function to load your unlabeled set, something like all_train_data = load_unlabeled_examples(). 2) when you run run_training.py, set the --save_train_logits flag. This will produce a file called logits.txt in the specified output directory that, for each unlabeled example in all_train_data, contains the logits for all labels.

For example, if your TaskProcessor's get_labels() function returns ["good", "bad"] and all_train_data = [ex0, ex1, ex2], then the model's logits for "bad" given ex2 correspond to the second number in the third line of logits.txt.

Mahhos commented 4 years ago

Thanks for the response. In the current version (not changing line 154 in run_training.py), may I put my unlabeled samples in the training file (at the end of the file for example) and set --save_train_logits to get the predicted labels? If not, should I put my unlabeled data in a separate csv file and define a new version of load_examples(), and get_dev_examples()/get_train_examples() to read my unlabeled data?

timoschick commented 4 years ago

should I put my unlabeled data in a separate csv file and define a new version of load_examples(), and get_dev_examples()/get_train_examples() to read my unlabeled data?

That would be the safest way, so I'd recommend doing it like that!

Mahhos commented 4 years ago

Thanks. I have got another question regarding the verbalizer. I am designing a custom PVP. How should I make sure that the language model would exactly fill the <MASK> with my tokens?

For example for the Yelp task, how did you know that the language model would exactly predict ["terrible"], ["bad"], ["okay"], ["good"], ["great"] and not any other synonyms of these words?

timoschick commented 4 years ago

If your verbalizer uses only the words terrible, bad, okay, good and great, then PET simply ignores the probabilities assigned to all other words. Let's assume the model's predictions are (in that order):

horrible # 0.30
awful    # 0.20
terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

PET basically removes all words that are not used by the verbalizer, resulting in the following reduced list:

terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

So PET would assign the label corresponding to terrible to this example, even if terrible is not the word that the language model would have predicted.

chris-aeviator commented 3 years ago

@timoschick

if I have labels 0 = 'bad' & 1 = 'good' I'll get an unlabeled_logits.txt with the first row beeing -1 and then a row for each row in my unlabeled.csv file.

Is it correct that I then apply softmax to it to get a prediction of the first label "bad" (corresponds to first "column" in logits file) and "good" (second "column")

example logits

-1
0.21161096000000001 0.3217776633333334
1.6751958333333334  -1.45424471

EDIT:

Ended up writing a conversion script (since I'm using an airflow pipeline anyways for the job) that writes me a prediction file with probabilities from the logits

import torch
import numpy as np        
import pandas as pd

logits_file = '/tmp/unlabeled_logits.txt'
results = []
with open(logits_file, 'r') as fh:
  for line in fh.read().splitlines():
    example_logits = [float(x) for x in line.split()]
    tensors = torch.tensor(example_logits)
    sm = torch.nn.Softmax()
    results.append(sm(tensors).numpy())
df = pd.DataFrame(results)
df.to_csv('/out/predictions.csv')

output is a propability for my label bad (first column) and good (2nd)

0.9937028288841248,0.006297166459262371