salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.68k stars 394 forks source link

Doubts regarding tokenizer. #28

Closed BakingBrains closed 2 years ago

BakingBrains commented 2 years ago

Can you please tell what is - paths = ['train_code.txt', 'train_doc.txt']

in train_tiokenizer.py file.

Is that the training code data and doc_string data in txt format?

Also, how can I run the inference using trained model. Because the training script is only generating model.bin, it is not giving any config, tokenizer etc..

Thank you

yuewang-cuhk commented 2 years ago

Hi, this code is provided to illustrate how we train the code-specific tokenizer using raw text and code corpus. To use this tokenizer and the model config, you can simply use:

from transformers import AutoTokenizer, T5Config
tokenizer = AutoTokenizer.from_pretrained('Salesforce/codet5-base')
config = T5Config.from_pretrained('Salesforce/codet5-base')
BakingBrains commented 2 years ago

@yuewang-cuhk I think you're not getting what I asked. Can you please read it again. I understood the part of using it from transformer. I asked about custom training and tokenizer.

Thanks