timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"
https://arxiv.org/abs/2001.07676
Apache License 2.0
1.62k stars 282 forks source link

PVP for other datasets #2

Closed loretoparisi closed 3 years ago

loretoparisi commented 3 years ago

Assumed to have a spam/ham dataset (like the simple SMS Spam dataset like in the following:

[ham]
Ok lar... Joking wif u oni...

[spam]
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entr...

How would you define a pattern for Pi(a), for s given input text a and the resulting verbalizer v?

Another case, is a dataset like the Hate Speech Dataset (A Measurement Study of Hate Speech in Social Media)

6,"i hate asian people, especially the filipinos and pakis. ",1402091070021,"Confessions,japan,tourist destinations,Travel",Canada,Manitoba,asian people,ethnicity
9,I can't stand white people. They make me furious.. And I'm white,1402409977473,"Confessions,racism,society",United States,California,white people,race

where we have the following labels: _behavior, race, sexualorientation, other, ethnicity, physical, class, religion, defined, and the sentences have been selected looking at the following patterns:

Schermata 2020-09-22 alle 09 08 44

So how the define the patterns Pi? My guess is that it one could be something like

P1(a) = I...
P2(a) = I cant'...
P3(a) = I don't like...
...

where

v(1) = hate
v(2) = stand
v(3) = like
...

Does it make sense? In that case, how a dataset row would look like? Thanks a lot!

timoschick commented 3 years ago

Finding good patterns and corresponding verbalizers is the trickiest part when using PET. Personally, I sometimes use the training set (even if it just contains 10 or so examples) to look for good patterns and verbalizers by evaluating their unsupervised performance. For spam detection, you could start with some trivial patterns like:

combined with a verbalizer v(spam) = Yes, v(ham) = No.

I don't really know the Hate Speech Dataset you refer to, but if the task is to find out what is targeted by a given piece of hate speech (and the labels you mention sound like this is the task), you could maybe try something like this:

loretoparisi commented 3 years ago

@timoschick thanks! Assumed I would try the simpler SMS Spam dataset, from where should I start? I was looking here at the DataProcessor class. Shall I have to extend this base class with a SMSSpamDataProcessor(DataProcessor)?

timoschick commented 3 years ago

A good starting point would be to look at https://github.com/timoschick/pet/tree/master/examples, which contains

Mahhos commented 3 years ago

After preparing custom_task_processor.py and custom_task_pvp.py how should we tell the program to read our customized files (instead of the main files) and run the registered new task? It seems that just running the commands under the PET Training and Evaluation does not do the task.