openai / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
https://openai.com/blog/better-language-models/
Other
22.57k stars 5.53k forks source link

enc.encoder["<|endoftext|>"] is wrong and nobody realizes it. #222

Open shawwn opened 4 years ago

shawwn commented 4 years ago

Relevant tweet chain:

https://twitter.com/theshawwn/status/1208169319223480322

https://twitter.com/theshawwn/status/1208171700057186304

Basically, you're prompting the model with <|endoftext|> (a single token with BPE value 50256 or whatever), but the BPE encoder encodes <|endoftext|> as <| end of text|>, five separate tokens. It's completely different.

shawwn commented 4 years ago

My main question is, were the OpenAI models trained with <|endoftext|> (a single token separating each document), or <| end of text |>, which is how the BPE encoder generates it?

shawwn commented 4 years ago

Update: It turns out the answer is that OpenAI trained their models by separating texts using the single-token <|endoftext|>, whereas most fine-tuning code is based on nshepperd's repo (https://github.com/nshepperd/gpt-2) which usually just uses the BPE encoder, and the BPE encoder generates <| end of text |> as 5 tokens.

inspire22 commented 4 years ago

Thanks! Any suggestions how to hack/patch the encoder to properly deal with this? Or if finetuning is sufficient, we could just use END or something as a token? Would that even be a single token? Or are the use of multiple tokens even really that bad? It is working to use this token as a truncate stop, so it's being returned by generate properly at least...

maxiedaniels commented 4 years ago

Whoa I think I just ran into this issue. Would really appreciate any help!!

ErikUden commented 3 years ago

So how does one stop the <|endoftext|> token to be randomly generated after just one sentence? Surely this must not be in the interest of the developers as it makes the "length" variable meaningless as truncating the text after the first <|endoftext|> returns randomized lengths and mostly too short lengths.

DaveXanatos commented 3 years ago

I wrote a patch such that if the output contains "<|endoftext|>" I just rerun the whole batch. Reason being that when <|endoftext|> shows up, everything following has no relation (usually) to what the input prompt was.

For my conversational robots, I have it truncate everything before <|endoftext|>, and state "I feel I should have something more to say here, but I'm not sure how to proceed." Conversationally, it works most of the time. Still an issue though, but not for 99% of the people interacting with my robots, so... :)

This is less of an issue if you're using teh 1558M model I've found. What model are you using? I got this a LOT on the 345M model.

ErikUden commented 3 years ago

Yes, I've been using the 355M parameter model. This got this issue with every generated text!

DaveXanatos commented 3 years ago

I got my 345M model to a pretty good spot with the following parameters:

def interact_model(
    model_name='345M', 
    seed=None,
    nsamples=1,
    batch_size=1,
    length=140,
    temperature=1.2,
    top_k=48,
    top_p=0.7,
    models_dir='models',
):

I still get those EOT things occasionally, but usually only one out of 7 or 8 prompts.