salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.66k stars 391 forks source link

What are the values of the special tokens ([CLM], [Match], etc)? #124

Open colinski opened 1 year ago

colinski commented 1 year ago

Hello, I'm working to apply the CodeT5+ model on my own dataset.

In the paper there is discussion of special task specific tokens: [CLM], [MASK*], [CLS], [Match], [CDec], [TDec], etc.

What are the values of these tokens? I can't seem to find it on the Github or HuggingFace repo.

Thanks

yuewang-cuhk commented 12 months ago

Hi there, only the models after two stages of pretraing will have such special tokens. In the tokenizer for our CodeT5+ 110M embedding model, you can find these special tokens (see here), i.e. [ENC]: 32100 (stand for [Match]), [TDEC]: 32101, [CDEC]: 32102. Some special tokens have a different form represented in the paper for the readability purpose. Specificially, we reuse the <unk> token for [CLM], <s> for [CLS], and <extra_id_*> for [MASK*]. Hope this clarify your confusion.