Closed caidongqi closed 1 year ago
Could anyone do me a favor plz...
You can try another API: tokenizer.convert_tokens_to_ids(YOUR_TOKEN). Since Roberta is case-sensitive, you may also try lowercase "society".
Thanks for answering!
But using lowercase doesn't work for me 😭
Bug still exists: Verbalization "society" does not correspond to a single token, got ['soc', 'iety']
For your first suggestion, I still don't know how it works yet. Here is the related code.
kwargs = {'add_prefix_space': True} if isinstance(tokenizer, GPT2Tokenizer) else {}
ids = tokenizer.encode(word, add_special_tokens=False, **kwargs)
if not force_single_token:
return ids
assert (
len(ids) == 1
), f'Verbalization "{word}" does not correspond to a single token, got {tokenizer.convert_ids_to_tokens(ids)}'
Roberta tokenizer converts one word into two tokens (with specific ids). But vanilla PET can only process one token. So the assertion check finds there are two ids and drops me an error.
Could you please explain more explictly how to modify it at your convenience.
Anyway, your suggestion does a great help to me, thanks again. Best wishes.
Hi,
GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a "Ġ".
Actually, "Society" is not a token in the vocab but "ĠSociety" is a valid one.
You can call tokenizer.convert_tokens_to_ids("ĠSociety")
and the result is 3930
.
The only thing you need to do is replace "tokenizer.encode(xxxxx)" with the following lines:
if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
space_word = "Ġ" + word
id = tokenizer.convert_tokens_to_ids(space_word)
else:
id = tokenizer.convert_tokens_to_ids(word)
Refer to this thread for more details: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante Best.
That works! Thanks for the solution and reference.🥳
Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding?
Roberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.
For example,
Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']
Now I just comment the code
assert ( # len(ids) == 1
in utils.py to enforce using the first tokenizer.But I don't know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?
Thanks~