Summarization Task Special Token

karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

MIT License

36.09k stars 5.62k forks source link

Summarization Task Special Token #267

Open nashcaps2255 opened 1 year ago

nashcaps2255 commented 1 year ago

Hi all, I trained a 355M~ parameter GPT from scratch, the model performs quite well with text entailment.

I finetuned the model to perform text summarization, structuring the data like so...

Text(s) to summarize <|summarize|> summary <|endoftext|>

It is my understanding that this finetuned model, at inference, when faced with text and the summarization special token, should, at least somewhat, understand the task. However, the model seems to be treating the summarization token as an end of text token. Meaning that, when keeping the seed and other parameters constant, the model will generate the same exact 'summary', no matter what the text before the <|summarize|> is.

Why would the model be treating this token as an end of text token?
Is there an additional step I should be taking other than just including the token in the training data + adding it to allowed_special_tokens in sample.py?

sjm1992st commented 1 year ago

I think you could try adding your special token code to encode/decode, and then finetune it

nashcaps2255 commented 1 year ago

Where would I add that, the train.py? I added <|summarize|> to the sample.py file for 'allowed_special'.

At a high level, I'm confused why the GPT is treating it exactly the same as a end of text token. You would think if there is something wrong with the encoding, it would instead treat it as a normal token, not exactly the same as the end of text token?

nashcaps2255 commented 1 year ago

Should each example have padding? Maybe that is the issue?

sjm1992st commented 1 year ago

Where would I add that, the train.py? I added <|summarize|> to the sample.py file for 'allowed_special'.

At a high level, I'm confused why the GPT is treating it exactly the same as a end of text token. You would think if there is something wrong with the encoding, it would instead treat it as a normal token, not exactly the same as the end of text token?

Do you add your special_tokens before finetune? you could add it in train.py after load 'meta.pkl'

nashcaps2255 commented 1 year ago

No I didn't, I will try this.

I was under, perhaps the false, impression that with enough examples the GPT would 'learn' the token.

VatsaDev commented 1 year ago

@nashcaps2255

You are not under a false impression, You Probably can do the above task, but it depends a lot on your model, hyperparameters, and dataset size. I have a Conversational Model Based off NanoGPT, with the same model as you, gpt2-medium, and I use the tokens <human> <bot> and <endOfText>

the model doesn't really treat one token like another, although my tokens probably show up way more than your tokens. definitely try increasing and maybe try using GPT2-XL. Its most likely a problem with dataset size, when I first started training on a small dataset, I had the same issue, but as I increased the dataset size, the problem disappeared

joHussien commented 8 months ago

@nashcaps2255 I hope my message finds you well and safe. I am approaching very similar application to what you were trying to do but I am facing some problems. Have you been able to solve your problems and fine tune it for summarization? Thank you

kbmmoran commented 7 months ago

Hey, different account same person :)

But yes I was, the issue was with my dataset, the data I was using wasn't enough of a 1-to-1 summary, once I added outside data (i.e. a large summary dataset) to the model finetuning, it was able to quite effectively learn the task.

So essentially exactly what @VatsaDev said was correct, dataset size issue