Documentation expansion - Trained set produces segments of sentences from the model

ephemeralfuture commented 2 years ago

Hello. I hope this issue is not duplicated – I have tried to look around and have spotted similar issues (like the short chunks of text issue) but have not encountered this. I am really seeking advice here because I seem to be doing something wrong, not because there is anything wrong with the program, so would really appreciate the help. I guess this might also highlight some baby-steps missing in the documentation as I think I have read through everything by now and feel a little lost on the training still.

I am following the guide for training a GPT-2 model from scratch. So far, I seem to have successfully trained a model on a small data-set (about 5,000 words) as an experiment. The text is taken from some old newspaper articles and has been saved in a .txt, similar to the example given using the Shakespeare script. So far, so good.

It took me a little while to get this up and working, and I think I have understood the documentation well enough, although I am not very familiar with NLP systems so might be missing something. My understanding of what the tokeniser is doing, for example, is very limited.

I (hope I) have repeat-trained this data-set a few times, and have generated files of text after training. However, when the text is generated, I only seem to get sentences from within my data-set. e.g. if the input includes a paragraph that begins with the sentence "The news reports that today rain has fallen across the country." the output reads something like: "The news reports that today rain has falle". And this then repeats itself several times, before generating other similar results that are just chunks of sentences from somewhere in the data-set.

So my question: Is there any documentation on the formatting of text that you need to feed into the system, and does it seem like this is what I am doing wrong? Alternatively, is this just too small a data-set? I had expected it was small, but thought it would still successfully produce something, even if totally gibberish. But repeating sentences from within my data-set implies that I've gone wrong somewhere, and I'm struggling to find a way around this in the documentation.

Thank you for any help, and if you need any more description I'd be happy to try to provide.

Apologies if I am posting in the wrong place; please feel free to point me to a support forum if there is one.

tientr commented 2 years ago

When training loss equal to 0 the model will remember all information in dataset so input prompt should be different from dataset. A large dataset is important and necessary to generate something impressive.

ephemeralfuture commented 2 years ago

Thank you for the answer.

I now understand that I need to increase dataset size, and need to run the training until the loss is as close to 0 as possible. Thank you.

With regard to formatting – I am not sure I understand. Is the <|endoftext|> tag XML? At present, the document I have is a plain text file, which I have used because the input.txt sample file of Shakespeare is also plain text. I can reformat to any database format if this is needed, but I haven't really understood this from the documentation or from what you have written.

To try to explain/give example: I am using news articles. These are split into paragraphs of text in large chunks, eg (this is one of about 100 articles, but I can get more):

Newspaper

INTERESTING TOWN TENANT'S CASES

OCCUPIERS RIGHTS UNAFFECTED BY

QUESTIONS OF REPAIRS

This court was held on Tuesday before Messrs. D. Molloy (presiding) and P. Cowley

HOUSE POSSESSION

Dr. J Coolican was granted a decree for possession of a house held by James Mason, Garden Street, who, it was stated, owed 54 weeks' rent.

Mr. P.J. Mulligan appeared for the applicant and the defendant did not appear.

Mr. R. Burke applied for possession of a house held by Mrs. Mary Monaghan at Barnadarrig.

Mr. Huggard appeared for the applicand Mr. P.J. Mulligan, on behalf of the Town Tenant's Association, defended the case.

The numbers here are just used to illustrate that this is line-by-line as it is stored in the document. Currently, this will produce results something like this:

(PROMPT): This court was held on Tuesday before Messrs. D. Molloy (pre (PROMPT): This court was held on Tuesday before Messrs. D. Molloy (pre (PROMPT): This court was held on Tuesday before Messrs. D. Molloy (pre (PROMPT): Mr. R. Burke applied for possession of a house h (etc.)

So something seems off about my actions and I just want to check: is it then necessary to format this text with tags to indicate end of line and end of text "chunk" (in this case, a news story) or should each line be <|endoftext|>? Or have I again misunderstood?

Apologies again if this is not the right forum for this question. If it is any use when I manage to get this working, I am happy to write this up as a use case if of any use to your documentation.

minimaxir / aitextgen

Documentation expansion - Trained set produces segments of sentences from the model #194