thunlp / OpenPrompt

An Open-Source Framework for Prompt-Learning.
https://thunlp.github.io/OpenPrompt/
Apache License 2.0
4.34k stars 449 forks source link

{"mask"} automatically assigns <extra_id_0>, which conflicts with the task of masked filling #234

Open ChristLBUPT opened 1 year ago

ChristLBUPT commented 1 year ago

I want to do prompt tuning for a masked-fill-based T5 model, which has the input format like this:

test_dataset = [
    InputExample(text_a="The quick <extra_id_0> fox <extra_id_1> over the lazy dog", tgt_text="<extra_id_0> brown <extra_id_1> jumps <extra_id_2>"),
    InputExample(text_a="The Capital city of China is <extra_id_0>, which has a <extra_id_1> of 20 million", tgt_text="<extra_id_0> Beijing <extra_id_1> population <extra_id_2>")
]

if I use the template similar to that given by 2.1_conditional_generation.py, that is:

template = ManualTemplate(t5tokenizer, '{"placeholder": "text_a"} {"special": "<eos>"} {"mask"}')

it will automatically assign an <extra_id_0>at the corresponding position of {"mask"}, splitting the original sentence and target sentence with special token </s>, which results in duplicate <extra_id_0>s in input sentence, just as follows:

The Capital city of China is, which has a of 20 million

I know it is possible to manually add 1 to each extra_id in my dataset, but is it possible to ONLY use the source sentence as input and avoid automatically adding extra ids?

yulinchen99 commented 1 year ago

I am not sure I understand your question correctly. It seems that you are not using {"mask"} and therefore not using verbalizer on the whole. That way you probably should just use the transformers library, without wrapping it up with PromptModel in openprompt.