Training on tokens available in textual file and how could I achieve a best model?

Dear all,

I am new in the field of NLP. I find transformer library, which is amazing well to generate text.

I came across your post about how could I train new language by using transformer.

Based on that, I have a question that I have already bunch of tokens available in text file (.txt) of programming language gathered from many repositories, separated with space character. I would like to first train tokenizer model as recommended by you, and then use transformer code run_lm_finetuning to fine tune the model as you have suggested.

For this purpose, what changes do I need to make in the code and how could I achieve best model? Please advise.

mgrankin / ru_transformers

Training on tokens available in textual file and how could I achieve a best model? #13