mingkaid / rl-prompt

Accompanying repo for the RLPrompt paper
MIT License
286 stars 52 forks source link

some Doubts about a symbol #10

Closed oujieww closed 1 year ago

oujieww commented 1 year ago

verbalizers = ['\u0120terrible', '\u0120great']

why need to add \u0120 ?

mingkaid commented 1 year ago

Thank you for reaching out. \u0120 ("Ġ") is the character for "space" in the BPE tokenizer, which is used by both GPT-2 and RoBERTa that we use for our work.

For example, the sentence "This is great" will be tokenized as ['This', 'Ġis', 'Ġgreat']. Indeed, we use the tokens with space because this is how the words "terrible" and "great" typically appear in sentences.

Other tokenizers also use special characters to represent spaces, e.g., WordPiece (BERT) uses "#" and SentencePiece (T-5) uses "▁" (see this issue for discussion).

I hope this answers your question, and feel free to follow up if you have more. I'm closing this issue now because it's a clarification question.

oujieww commented 1 year ago

Thank you for your patient reply.