Closed oujieww closed 1 year ago
Thank you for reaching out. \u0120 ("Ġ") is the character for "space" in the BPE tokenizer, which is used by both GPT-2 and RoBERTa that we use for our work.
For example, the sentence "This is great" will be tokenized as ['This', 'Ġis', 'Ġgreat']
. Indeed, we use the tokens with space because this is how the words "terrible" and "great" typically appear in sentences.
Other tokenizers also use special characters to represent spaces, e.g., WordPiece (BERT) uses "#" and SentencePiece (T-5) uses "▁" (see this issue for discussion).
I hope this answers your question, and feel free to follow up if you have more. I'm closing this issue now because it's a clarification question.
Thank you for your patient reply.
verbalizers = ['\u0120terrible', '\u0120great']
why need to add \u0120 ?