karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.08k stars 839 forks source link

Need train_multi.py example to show use with multiple input files #47

Closed gnp closed 5 months ago

gnp commented 7 months ago

Since there are significant concerns about handling <|endoftext|>, there should be an example program that shows how to properly prepare input text for that case and pass to the train function and check that the trained tokenizer is as intended.

gnp commented 7 months ago

For example, _testtokenizer.py is written as if <|endoftext|> doesn't belong at the end of each input document, or in between input documents, but rather at the start of each input document, which is surprising given its name.

And the example should show what transformations should be made to all the input documents before concatenating them to make them safe to use, even if they contain the string <|endoftext|> in them.

Or, training should take a sequence of texts so that no parsing of a real special token string is required, and instead the virtual token can be inserted as part of the training iteration over the sequence of inputs.

It is not clear from the examples how to safely use the tokenizer. Formalizing that would entail a codec that safely concatenates documents in a reversible way. The use of that codec would need to be integrated into the tokenizer's flow & configuration.

For example, prior art from SMTP could be used this way:

This approach opens up the possibility of other directives besides ".EOF" if desired. For example instead of ".EOF" after or between documents, you could instead do ".BOT" before and .EOT after each document (with any text between .EOT and .BOT silently ignored) if full delimiting is desirable for model training or other reasons.

The _trainmulti.py example should clearly show that the contents of each input file is being treated as pure text byte-for-byte as the original file -- with no spurious instances of file text being interpreted as control tokens. And, that the special token(s) are being inserted into the downstream output nonetheless at the appropriate points.

karpathy commented 7 months ago

When you're training you don't care about special tokens usually, to create training data you'd directly insert the token ids as integers in between documents instead of changing the text to contain <|endoftext|> or something like that.

gnp commented 7 months ago

This is in response to part of your video about the tokenizer. And, that does affect tokenizer training and evaluation and causes "footguns" if I've understood your presentation correctly.

For example, if one were to train a tokenizer based on the minbpe repo's *.md and .py files, the presence of the <|endoftext|> character sequence in some of them results in spurrious incorrect tokens because of the issue you raised and I'm responding to.

Between the tokenizer's implementation and tokenizer training, and the input preparation and tokenizer use for evaluation, the problems you raised occur. This issue is about (a) requesting an example in the repo showing how to properly use as intended to avoid the problems; and (b) showing that different choices for how special tokens are represented in (prepared) text make the problem more amenable to solving.