milmor / GPT

Implementation of Generative Pretrained Transformer Model in Tensorflow / Keras
MIT License
32 stars 10 forks source link

LTR language #1

Closed toozande closed 1 year ago

toozande commented 1 year ago

i want to build my GPT language model on Persian and Arabic language can i use this ? as you now language like Persian and Arabic are different in structure . what about wiki_en_vocab file ? why ## use before word in wiki_en_vocab? regards

milmor commented 1 year ago

HI,

Sure, you require to create your vocabulary. Also, if you have limited computational resources, adjust line 90 in train.py to speed up the building vocabulary process.

wiki_en_vocab is built on the English Wikipedia dataset; I don't know if it would work well in Persian and Arabic, but you can try.

"##" works to indicate prefixes or suffixes

toozande commented 1 year ago

thanks but how can i use my own datasets instead OpenWebText dateset? when i use python train.py --model_dir=model --ds_name='data.txt' on my own data.txt file i get error : "ValueError: Parsing builder name string 'data.txt' failed. The builder name string must be of the following format: dataset_name[/config_name][:version][/kwargs] " how and what format of dataset i must use? regards

milmor commented 1 year ago

The model reads datasets from TensorFlow Datasets https://www.tensorflow.org/datasets?hl=en.

To use your txt file, you must define your pipeline.