Question on encoding the special token <|context|>

salesforce / simpletod

Official repository for "SimpleTOD: A Simple Language Model for Task-Oriented Dialogue"

https://arxiv.org/abs/2005.00796

BSD 3-Clause "New" or "Revised" License

235 stars 79 forks source link

Question on encoding the special token <|context|> #14

Open pleomax0730 opened 3 years ago

pleomax0730 commented 3 years ago

Hi,

I tried to run the DST training script in vscode debug mode. I found that the <|context|> in train.history_belief was encoded to a list of tokens rather than a single token. ['Ġ<', '|', 'context', '|', '>'] and its corresponding ids [1279, 91, 22866, 91, 29]

I tried to track the tokenizer, but the token: "<|context|>" was not added to the gpt2 vocabulary on purpose.

I'm wondering where did I go wrong, or this result is right?