noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

Fixed the issue of being unable to handle transformer added/expanded model tokens #83

Closed Qubitium closed 3 months ago

Qubitium commented 4 months ago

For transformers, tokenizer.vocab_size excludes all tokens added via token expansion. Correct usage here is len(tokenizer).

ref: https://stackoverflow.com/questions/67412925/what-is-the-difference-between-lentokenizer-and-tokenizer-vocab-size ref: https://github.com/huggingface/tokenizers/issues/900#issuecomment-1028784677

Without this PR, any new custom tokens added to the transformer model and subsequently trained will be invisible to the lm-format-enforcer.

Qubitium commented 4 months ago

@turboderp my pr fix only fixed tranformer tokenizer integration as I am unfamiliar with exllama tokenizer. Perhaps the same patch is also required for exllama integration depending how the exllama tokenizer normalize "vocab size"? I find the transformer current discrepancy a little strange.

Qubitium commented 3 months ago

@noamgat Please review this bug fix. Thanks.

noamgat commented 3 months ago

Thanks for the contribution!

JoshC8C7 commented 2 months ago

Just a warning that this change now prohibits using models whose vocab size (normal + added tokens) is larger than its model's embedding size (for a model which hasn't been retrained to output the added tokens). Not a common case (although it is my own, as I've got extra post-tokenization-pre-inference steps) but worth noting nonetheless.