Open shawwn opened 4 years ago
My main question is, were the OpenAI models trained with <|endoftext|> (a single token separating each document), or <| end of text |>, which is how the BPE encoder generates it?
Update: It turns out the answer is that OpenAI trained their models by separating texts using the single-token <|endoftext|>
, whereas most fine-tuning code is based on nshepperd's repo (https://github.com/nshepperd/gpt-2) which usually just uses the BPE encoder, and the BPE encoder generates <| end of text |>
as 5 tokens.
Thanks! Any suggestions how to hack/patch the encoder to properly deal with this? Or if finetuning is sufficient, we could just use END or something as a token? Would that even be a single token? Or are the use of multiple tokens even really that bad? It is working to use this token as a truncate stop, so it's being returned by generate properly at least...
Whoa I think I just ran into this issue. Would really appreciate any help!!
So how does one stop the <|endoftext|> token to be randomly generated after just one sentence? Surely this must not be in the interest of the developers as it makes the "length" variable meaningless as truncating the text after the first <|endoftext|> returns randomized lengths and mostly too short lengths.
I wrote a patch such that if the output contains "<|endoftext|>" I just rerun the whole batch. Reason being that when <|endoftext|> shows up, everything following has no relation (usually) to what the input prompt was.
For my conversational robots, I have it truncate everything before <|endoftext|>, and state "I feel I should have something more to say here, but I'm not sure how to proceed." Conversationally, it works most of the time. Still an issue though, but not for 99% of the people interacting with my robots, so... :)
This is less of an issue if you're using teh 1558M model I've found. What model are you using? I got this a LOT on the 345M model.
Yes, I've been using the 355M parameter model. This got this issue with every generated text!
I got my 345M model to a pretty good spot with the following parameters:
def interact_model(
model_name='345M',
seed=None,
nsamples=1,
batch_size=1,
length=140,
temperature=1.2,
top_k=48,
top_p=0.7,
models_dir='models',
):
I still get those EOT things occasionally, but usually only one out of 7 or 8 prompts.
Relevant tweet chain:
https://twitter.com/theshawwn/status/1208169319223480322
https://twitter.com/theshawwn/status/1208171700057186304
Basically, you're prompting the model with <|endoftext|> (a single token with BPE value 50256 or whatever), but the BPE encoder encodes <|endoftext|> as <| end of text|>, five separate tokens. It's completely different.