Closed toozande closed 1 year ago
HI,
Sure, you require to create your vocabulary. Also, if you have limited computational resources, adjust line 90 in train.py to speed up the building vocabulary process.
wiki_en_vocab is built on the English Wikipedia dataset; I don't know if it would work well in Persian and Arabic, but you can try.
"##" works to indicate prefixes or suffixes
thanks but how can i use my own datasets instead OpenWebText dateset? when i use python train.py --model_dir=model --ds_name='data.txt' on my own data.txt file i get error : "ValueError: Parsing builder name string 'data.txt' failed. The builder name string must be of the following format: dataset_name[/config_name][:version][/kwargs] " how and what format of dataset i must use? regards
The model reads datasets from TensorFlow Datasets https://www.tensorflow.org/datasets?hl=en.
To use your txt file, you must define your pipeline.
i want to build my GPT language model on Persian and Arabic language can i use this ? as you now language like Persian and Arabic are different in structure . what about wiki_en_vocab file ? why ## use before word in wiki_en_vocab? regards