PVP for other datasets - Githubissues

loretoparisi commented 3 years ago

Assumed to have a spam/ham dataset (like the simple SMS Spam dataset like in the following:

[ham]
Ok lar... Joking wif u oni...

[spam]
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entr...

How would you define a pattern for Pi(a), for s given input text a and the resulting verbalizer v?

Another case, is a dataset like the Hate Speech Dataset (A Measurement Study of Hate Speech in Social Media)

6,"i hate asian people, especially the filipinos and pakis. ",1402091070021,"Confessions,japan,tourist destinations,Travel",Canada,Manitoba,asian people,ethnicity
9,I can't stand white people. They make me furious.. And I'm white,1402409977473,"Confessions,racism,society",United States,California,white people,race

where we have the following labels: _behavior, race, sexualorientation, other, ethnicity, physical, class, religion, defined, and the sentences have been selected looking at the following patterns:

So how the define the patterns Pi? My guess is that it one could be something like

P1(a) = I...
P2(a) = I cant'...
P3(a) = I don't like...
...

where

v(1) = hate
v(2) = stand
v(3) = like
...

Does it make sense? In that case, how a dataset row would look like? Thanks a lot!

timoschick commented 3 years ago

Finding good patterns and corresponding verbalizers is the trickiest part when using PET. Personally, I sometimes use the training set (even if it just contains 10 or so examples) to look for good patterns and verbalizers by evaluating their unsupervised performance. For spam detection, you could start with some trivial patterns like:

P_1(a) = "a". Is this text spam? [MASK]
P_2(a) = Does the following text contain spam? [MASK] "a".

combined with a verbalizer v(spam) = Yes, v(ham) = No.

I don't really know the Hate Speech Dataset you refer to, but if the task is to find out what is targeted by a given piece of hate speech (and the labels you mention sound like this is the task), you could maybe try something like this:

P_1(a) = "a". This text targets people by their [MASK].
P_2(a) = "a". This text insults people based on their [MASK].

with a trivial verbalizer v(race) = race, v(religion) = religion and so on.

loretoparisi commented 3 years ago

@timoschick thanks! Assumed I would try the simpler SMS Spam dataset, from where should I start? I was looking here at the DataProcessor class. Shall I have to extend this base class with a SMSSpamDataProcessor(DataProcessor)?

timoschick commented 3 years ago

A good starting point would be to look at https://github.com/timoschick/pet/tree/master/examples, which contains

an example for a custom data processor (like the SMSSpamDataProcessor you suggested) in custom_task_processor.py
an example for a custom PVP (like the ones I proposed in my previous comment) in custom_task_pvp.py

Mahhos commented 3 years ago

After preparing custom_task_processor.py and custom_task_pvp.py how should we tell the program to read our customized files (instead of the main files) and run the registered new task? It seems that just running the commands under the PET Training and Evaluation does not do the task.

timoschick / pet

PVP for other datasets #2