Roberta-large using BPE tokenizer generates multi tokens.

timoschick / pet

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

https://arxiv.org/abs/2001.07676

Apache License 2.0

1.62k stars 282 forks source link

Roberta-large using BPE tokenizer generates multi tokens. #92

Closed caidongqi closed 1 year ago

caidongqi commented 2 years ago

Roberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.

For example, Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']

Now I just comment the code assert ( # len(ids) == 1 in utils.py to enforce using the first tokenizer.

But I don't know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?

Thanks~

caidongqi commented 2 years ago

Could anyone do me a favor plz...

huchinlp commented 1 year ago

You can try another API: tokenizer.convert_tokens_to_ids(YOUR_TOKEN). Since Roberta is case-sensitive, you may also try lowercase "society".

caidongqi commented 1 year ago

Thanks for answering! But using lowercase doesn't work for me 😭 Bug still exists: Verbalization "society" does not correspond to a single token, got ['soc', 'iety']

For your first suggestion, I still don't know how it works yet. Here is the related code.

kwargs = {'add_prefix_space': True} if isinstance(tokenizer, GPT2Tokenizer) else {}
ids = tokenizer.encode(word, add_special_tokens=False, **kwargs)
if not force_single_token:
    return ids
assert (
    len(ids) == 1
), f'Verbalization "{word}" does not correspond to a single token, got {tokenizer.convert_ids_to_tokens(ids)}'

Roberta tokenizer converts one word into two tokens (with specific ids). But vanilla PET can only process one token. So the assertion check finds there are two ids and drops me an error.

Could you please explain more explictly how to modify it at your convenience.

Anyway, your suggestion does a great help to me, thanks again. Best wishes.

huchinlp commented 1 year ago

Hi,

GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a "Ġ". Actually, "Society" is not a token in the vocab but "ĠSociety" is a valid one. You can call tokenizer.convert_tokens_to_ids("ĠSociety") and the result is 3930.

The only thing you need to do is replace "tokenizer.encode(xxxxx)" with the following lines:

if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
     space_word = "Ġ" + word
     id = tokenizer.convert_tokens_to_ids(space_word)
else:
     id = tokenizer.convert_tokens_to_ids(word)

Refer to this thread for more details: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante Best.

caidongqi commented 1 year ago

That works! Thanks for the solution and reference.🥳

nieallen commented 1 year ago

Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding?