Which tokenizer to use for customized python summary data?

salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

https://arxiv.org/abs/2305.07922

BSD 3-Clause "New" or "Revised" License

2.68k stars 394 forks source link

Which tokenizer to use for customized python summary data? #30

Closed blurLake closed 2 years ago

blurLake commented 2 years ago

Hi, I am fine-tuning codeT5 base model. I see in exp_with_args.sh that for python summarization task, RobertaTokenizer is used. However, in the data you shared in here&prefix=&forceOnObjectsSortingFiltering=false) does not look like to be generated by RobertaTokenizer. Since Robertatokenizer will tokenize space as Ġ, see e.g.. here, but in the data you uploaded, there is no such Ġ in code_token, nor in string_token.

Could you comment on this? Thank you very much!

yuewang-cuhk commented 2 years ago

Hi, the shared data is the raw data that will be tokenized by our code-specific RobertaTokenizer (not the original one). We use this function to read summarization data and this unified tokenizing function to convert it to features.