Closed BakingBrains closed 2 years ago
Hi, this code is provided to illustrate how we train the code-specific tokenizer using raw text and code corpus. To use this tokenizer and the model config, you can simply use:
from transformers import AutoTokenizer, T5Config
tokenizer = AutoTokenizer.from_pretrained('Salesforce/codet5-base')
config = T5Config.from_pretrained('Salesforce/codet5-base')
@yuewang-cuhk I think you're not getting what I asked. Can you please read it again. I understood the part of using it from transformer. I asked about custom training and tokenizer.
Thanks
Can you please tell what is -
paths = ['train_code.txt', 'train_doc.txt']
in train_tiokenizer.py file.
Is that the training code data and doc_string data in txt format?
Also, how can I run the inference using trained model. Because the training script is only generating model.bin, it is not giving any config, tokenizer etc..
Thank you