Hi,
I am fine-tuning codeT5 base model. I see in exp_with_args.sh that for python summarization task, RobertaTokenizer is used. However, in the data you shared in here&prefix=&forceOnObjectsSortingFiltering=false) does not look like to be generated by RobertaTokenizer. Since Robertatokenizer will tokenize space as Ġ, see e.g.. here, but in the data you uploaded, there is no such Ġ in code_token, nor in string_token.
Hi, the shared data is the raw data that will be tokenized by our code-specific RobertaTokenizer (not the original one). We use this function to read summarization data and this unified tokenizing function to convert it to features.
Hi, I am fine-tuning codeT5 base model. I see in exp_with_args.sh that for python summarization task, RobertaTokenizer is used. However, in the data you shared in here&prefix=&forceOnObjectsSortingFiltering=false) does not look like to be generated by RobertaTokenizer. Since Robertatokenizer will tokenize space as Ġ, see e.g.. here, but in the data you uploaded, there is no such Ġ in code_token, nor in string_token.
Could you comment on this? Thank you very much!