how to make tfrecord with a released tokenizer

salesforce / jaxformer

Minimal library to train LLMs on TPU in JAX with pjit().

BSD 3-Clause "New" or "Revised" License

270 stars 35 forks source link

how to make tfrecord with a released tokenizer #17

Open HaebinShin opened 1 year ago

HaebinShin commented 1 year ago

Hi, @enijkamp I am trying to make tfrecord for fine-tune with my own dataset, but I am confusing which tokenizer to use it. I wanna make it with your released Tokenizer, but 4_create_tf_records.py use GPT2 tokenizer or custom tokenizer made by 3_train_tokenizer.py.

If I want to use your official tokenizer, is it right to change this line to 'Salesforce/codegen-xxB-xxx'?