Open colinski opened 1 year ago
Hi there, only the models after two stages of pretraing will have such special tokens. In the tokenizer for our CodeT5+ 110M embedding model, you can find these special tokens (see here), i.e. [ENC]
: 32100 (stand for [Match]
), [TDEC]
: 32101, [CDEC]
: 32102. Some special tokens have a different form represented in the paper for the readability purpose. Specificially, we reuse the <unk>
token for [CLM]
, <s>
for [CLS]
, and <extra_id_*>
for [MASK*]
. Hope this clarify your confusion.
Hello, I'm working to apply the CodeT5+ model on my own dataset.
In the paper there is discussion of special task specific tokens: [CLM], [MASK*], [CLS], [Match], [CDec], [TDec], etc.
What are the values of these tokens? I can't seem to find it on the Github or HuggingFace repo.
Thanks