salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.71k stars 396 forks source link

tokenizer suggestion #4

Closed pocca2048 closed 2 years ago

pocca2048 commented 2 years ago

Hi, thanks for sharing your great work!

Following the link from huggingface transformers documentation, I think it would be better to save tokenizer by tokenizer.save rather than tokenizer.save_model.

That is, https://github.com/salesforce/CodeT5/blob/466b8607fd08bc4bd8847cc6590c801a9c21db23/tokenizer/train_tokenizer.py#L18 change this to tokenizer.save("tokenizer.json")

Then, you can use transformers transformers.PretrainedTokenizerFast rather than tokenizers.Tokenizer at https://github.com/salesforce/CodeT5/blob/208acbd759fd8014374387b272647ef7ab4b85e3/tokenizer/apply_tokenizer.py#L3-L6 like this:

tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
yuewang-cuhk commented 2 years ago

Hi @pocca2048, thanks for the suggestions! We have uploaded CodeT5 to Hugging Face so that you can load our model and tokenizer using:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-small')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-small')