Closed loretoparisi closed 3 years ago
Finding good patterns and corresponding verbalizers is the trickiest part when using PET. Personally, I sometimes use the training set (even if it just contains 10 or so examples) to look for good patterns and verbalizers by evaluating their unsupervised performance. For spam detection, you could start with some trivial patterns like:
P_1(a) = "a". Is this text spam? [MASK]
P_2(a) = Does the following text contain spam? [MASK] "a".
combined with a verbalizer v(spam) = Yes, v(ham) = No
.
I don't really know the Hate Speech Dataset you refer to, but if the task is to find out what is targeted by a given piece of hate speech (and the labels you mention sound like this is the task), you could maybe try something like this:
P_1(a) = "a". This text targets people by their [MASK].
P_2(a) = "a". This text insults people based on their [MASK].
with a trivial verbalizer v(race) = race
, v(religion) = religion
and so on.
@timoschick thanks! Assumed I would try the simpler SMS Spam dataset, from where should I start?
I was looking here at the DataProcessor
class. Shall I have to extend this base class with a SMSSpamDataProcessor(DataProcessor)
?
A good starting point would be to look at https://github.com/timoschick/pet/tree/master/examples
, which contains
SMSSpamDataProcessor
you suggested) in custom_task_processor.py
custom_task_pvp.py
After preparing custom_task_processor.py
and custom_task_pvp.py
how should we tell the program to read our customized files (instead of the main files) and run the registered new task? It seems that just running the commands under the PET Training and Evaluation
does not do the task.
Assumed to have a spam/ham dataset (like the simple SMS Spam dataset like in the following:
How would you define a pattern for
Pi(a)
, for s given input texta
and the resulting verbalizerv
?Another case, is a dataset like the Hate Speech Dataset (A Measurement Study of Hate Speech in Social Media)
where we have the following labels: _behavior, race, sexualorientation, other, ethnicity, physical, class, religion, defined, and the sentences have been selected looking at the following patterns:
So how the define the patterns
Pi
? My guess is that it one could be something likewhere
Does it make sense? In that case, how a dataset row would look like? Thanks a lot!