timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"
https://arxiv.org/abs/2001.07676
Apache License 2.0
1.62k stars 283 forks source link

Multiple label token on PET makes error when training #24

Closed darwinharianto closed 3 years ago

darwinharianto commented 3 years ago

First, thanks for sharing.

I have been trying to make classification models using BERT as base. When since i am trying to use Japanese model, i used cl-tohoku models from hugging face. Training classification using hugging face and custom MLP models for my case achieved around 70% accuracy. I have been trying to improve the performance, then i saw this repository.

Since the classification uses multiple words as a class, i used command like this

python3 cli.py --method pet \
--pattern_ids 0 1 2 3 4 5\
--model_type bert \
--model_name_or_path cl-tohoku/bert-base-japanese-whole-word-masking \
--data_dir ../custom-data \
--task_name custom \
--output_dir custom-output \
--do_train \
--do_eval \
--train_examples 500 \
--unlabeled_examples 500 \
--split_examples_evenly \
--pet_per_gpu_train_batch_size 1 \
--pet_per_gpu_unlabeled_batch_size 1 \
--pet_gradient_accumulation_steps 1 \
--pet_max_steps 250 \
--lm_training \
--sc_per_gpu_train_batch_size 1 \
--sc_per_gpu_unlabeled_batch_size 1 \
--sc_gradient_accumulation_steps 1 \
--sc_max_steps 5000 \
--pet_per_gpu_eval_batch_size 1 \
--sc_max_seq_length 512 \
--pet_max_seq_length 512 \
--pet_repetitions 3

At first, i tried to use train_examples 10, i only got 0.04 accuracy when i increased train_examples to 500, i got 0.3 accuracy. Is this expected?

I tried to read the paper, this train_examples is used for making PLM right? This trained model then will be used to annotate unlabeled data. This soft labeled data would then be used to train a classifier. Am I getting it right?

For the train_examples, i believe input would look like this [CLS,....., [mask], SEP]. If this already have length of 512, with multiple token labels, this would output an error. Is there any way to limit the input size, so when i have multiple token labels, it would not exceed max length?

timoschick commented 3 years ago

Hi, the input length is automatically reduced to 512 tokens while leaving the tokens corresponding to labels intact. However, you need to define an appropriate PVP that always includes the maximum number of masks required for any label. Could you share your custom PVP?

Edit: As this was missing from the documentation, I've updated the README file accordingly: https://github.com/timoschick/pet#pet-with-multiple-masks

darwinharianto commented 3 years ago

Thanks for the clarification. This is my PVP class. There is only 1 self.mask when returning get_parts.

class TestPVP(PVP):
    VERBALIZER = {
        "1": ["他"],
        "2": ["西大須"]
    }

    def get_parts(self, example: InputExample) -> FilledPattern:

        text_a = self.shortenable(example.text_a)
        text_b = self.shortenable(example.text_b)

        if self.pattern_id == 0:
            return [self.mask, ':', text_a, text_b], []
        elif self.pattern_id == 1:
            return [self.mask, '問題は:', text_a, text_b], []
        elif self.pattern_id == 2:
            return [text_a, '(', self.mask, ')', text_b], []
        elif self.pattern_id == 3:
            return [text_a, text_b, '(', self.mask, ')'], []
        elif self.pattern_id == 4:
            return ['[ クラス分類:', self.mask, ']', text_a, text_b], []
        elif self.pattern_id == 5:
            return [self.mask, '-', text_a, text_b], []
        else:
            raise ValueError("No pattern implemented for id {}".format(self.pattern_id))

    def verbalize(self, label) -> List[str]:
        return TestPVP.VERBALIZER[label]

Thanks for the documentation.

respectively, your get_parts() method should always return a sequence that contains exactly 3 mask tokens Does this mean that i have to multiply self.mask by the number of maximum mask tokens?

class TestPVP(PVP):
    VERBALIZER = {
        "1": ["他"],
        "2": ["西大須"]
    }

    def get_parts(self, example: InputExample) -> FilledPattern:

        text_a = self.shortenable(example.text_a)
        text_b = self.shortenable(example.text_b)

        if self.pattern_id == 0:
            return [self.mask, self.mask, ':', text_a, text_b], []
        elif self.pattern_id == 1:
            return [self.mask, self.mask, '問題は:', text_a, text_b], []
        elif self.pattern_id == 2:
            return [text_a, '(', self.mask, self.mask, ')', text_b], []
        elif self.pattern_id == 3:
            return [text_a, text_b, '(', self.mask, self.mask, ')'], []
        elif self.pattern_id == 4:
            return ['[ クラス分類:', self.mask, self.mask, ']', text_a, text_b], []
        elif self.pattern_id == 5:
            return [self.mask, self.mask, '-', text_a, text_b], []
        else:
            raise ValueError("No pattern implemented for id {}".format(self.pattern_id))

    def verbalize(self, label) -> List[str]:
        return TestPVP.VERBALIZER[label]
timoschick commented 3 years ago

Does this mean that i have to multiply self.mask by the number of maximum mask tokens?

Yes, exactly. I don't know how exactly "西大須" gets tokenized, but if it consists of two tokens, you'll have to insert two masks (as you did in your bottom example).