Closed Mahhos closed 3 years ago
Hi, this is a bit difficult to do in the current version because PET expects the unlabeled examples to be in the same file as the labeled examples (this will be fixed in the next version, which will hopefully be released in ~2 weeks). What you can do until then is the following:
1) replace line 154 in run_training.py
(all_train_data = load_examples(args.task_name, args.data_dir, args.lm_train_examples_per_label, evaluate=False)
) with some custom function to load your unlabeled set, something like all_train_data = load_unlabeled_examples()
.
2) when you run run_training.py
, set the --save_train_logits
flag. This will produce a file called logits.txt
in the specified output directory that, for each unlabeled example in all_train_data
, contains the logits for all labels.
For example, if your TaskProcessor
's get_labels()
function returns ["good", "bad"]
and all_train_data = [ex0, ex1, ex2]
, then the model's logits for "bad" given ex2
correspond to the second number in the third line of logits.txt
.
Thanks for the response. In the current version (not changing line 154 in run_training.py
), may I put my unlabeled samples in the training file (at the end of the file for example) and set --save_train_logits
to get the predicted labels?
If not, should I put my unlabeled data in a separate csv file and define a new version of load_examples()
, and get_dev_examples()
/get_train_examples()
to read my unlabeled data?
should I put my unlabeled data in a separate csv file and define a new version of load_examples(), and get_dev_examples()/get_train_examples() to read my unlabeled data?
That would be the safest way, so I'd recommend doing it like that!
Thanks. I have got another question regarding the verbalizer. I am designing a custom PVP. How should I make sure that the language model would exactly fill the <MASK>
with my tokens?
For example for the Yelp task, how did you know that the language model would exactly predict ["terrible"], ["bad"], ["okay"], ["good"], ["great"]
and not any other synonyms of these words?
If your verbalizer uses only the words terrible
, bad
, okay
, good
and great
, then PET simply ignores the probabilities assigned to all other words. Let's assume the model's predictions are (in that order):
horrible # 0.30
awful # 0.20
terrible # 0.20
bad # 0.10
...
okay # 0.02
good # 0.01
great # 0.01
PET basically removes all words that are not used by the verbalizer, resulting in the following reduced list:
terrible # 0.20
bad # 0.10
...
okay # 0.02
good # 0.01
great # 0.01
So PET would assign the label corresponding to terrible
to this example, even if terrible
is not the word that the language model would have predicted.
@timoschick
if I have labels 0 = 'bad'
& 1 = 'good'
I'll get an unlabeled_logits.txt
with the first row beeing -1 and then a row for each row in my unlabeled.csv file.
Is it correct that I then apply softmax to it to get a prediction of the first label "bad" (corresponds to first "column" in logits file) and "good" (second "column")
example logits
-1
0.21161096000000001 0.3217776633333334
1.6751958333333334 -1.45424471
EDIT:
Ended up writing a conversion script (since I'm using an airflow pipeline anyways for the job) that writes me a prediction file with probabilities from the logits
import torch
import numpy as np
import pandas as pd
logits_file = '/tmp/unlabeled_logits.txt'
results = []
with open(logits_file, 'r') as fh:
for line in fh.read().splitlines():
example_logits = [float(x) for x in line.split()]
tensors = torch.tensor(example_logits)
sm = torch.nn.Softmax()
results.append(sm(tensors).numpy())
df = pd.DataFrame(results)
df.to_csv('/out/predictions.csv')
output is a propability for my label bad (first column) and good (2nd)
0.9937028288841248,0.006297166459262371
Hi. Thanks for the great repo. I have got a question regarding the PET training and annotating an unlabeled set (as mentioned in the paper examples from D). I assume that it would be done using the command in the
PET Training and Evaluation
section in the repo. However, I am not sure where to put the unlabeled set and where to get the predicted labels? Would you please let me know how we should get the predicted labels for the unlabeled set? Thank you.