Open gabe-l-hart opened 1 week ago
Thanks for the updates and context! I'd be interested in seeing what your working implementation is for converting out of HF's tokenizer lib.
Sure! This is what I have for tiktoken_converter.py
The main gap is around handling the pretokenizer. In tokenizers
, there are a number of pretokenizer types that either take multiple regexes evaluated in sequence (later exprs evaluated on the chunks found by previous splits), or are just different classes with different splitting logic (I haven't fully delved this yet).
The other piece that is not yet portable is the addition of special tokens other than those used by the llama*
models. The tokenizer.model
format seems to only encode the vocab (ranks) and doesn't seem to have a way to include the special tokens from what I can tell.
Draft PR up: https://github.com/pytorch/torchchat/pull/1261
I've noted some more details on the open investigation questions in the Discussion section of the PR.
@Jack-Khuu I've been digging into the landscape of the c++ code a bit. It looks like in addition to supporting this in torchchat
, we'd need to also extend the tokenizer
functionality in executorch which seems to be logically equivalent, but implemented differently. I think the set of issues in both codebases is similar, though:
tokenizers.model
format for either custom regex(es) or special tokensThe underlying guts of the decode
implementation in the two tiktoken.cpp
implementations would actually not be terribly hard to update to support 1-3, but doing so would definitely break the 1:1 compatibility nature with the original tiktoken
implementation. Similarly, it wouldn't be terribly difficult to add special parsing logic to parse additional metadata fields from the tokenizer.model
format, but that would also break compatibility with the true tiktoken
format.
Given this, I think we could go one of two ways:
tiktoken
in both projects to be a generic regex/special-token tokenizer
tokenizer.json
format from tokenizers
json
parsing support (e.g. vendoring a copy of nlohmann/json)Given the compatibility concerns, my initial preference would be for (2), but I want to kick off the conversation since either one would be a pretty significant change.
Thanks for the details and analysis, I'll hop over to the PR to comment
🚀 The feature, motivation and pitch
The request is to extend the tokenizer module in
torchchat
to support tokenizers that use the Huggingface tokenizers library.There are many models out there that use
tokenizers
which won't be able to run intorchchat
until they can be loaded and run either via thetokenizers
library directly or via a conversion totiktoken
orsentencepiece
.Alternatives
It may be possible to convert a
tokenizers
tokenizer to atiktoken
tokenizer. I have a working implementation of this for thellama
tokenizer.json model, however other models that use differenttokenizers
configurations do not work (in particular Granite Code).Additional context
This issue is a piece of the puzzle for adding support for Granite Code 3b/8b which use the llama architecture in transormers, but take advantage several pieces of the architecture that are not currently supported by torchchat. The work-in-progress for Granite Code can be found on my fork: https://github.com/gabe-l-hart/torchchat/tree/GraniteCodeSupport.
I have a less fully-fleshed working version of this that I plan to put up as a Draft PR for discussion. I am not intimately familiar with the algorithmic differences between
tiktoken
and the varioustokenizers
pieces (in particular thepretokenizer
s). My branch has a python implementation that simply wrapstokenizers
, but I have not yet tried to export Granite Code to other formats where I suspect it would break without a correspondingc++
implementation. I plan to investigate this further soon!RFC (Optional)
No response