Closed taylorgoolsby closed 3 months ago
Hi @taylorgoolsby! CodeGenTokenizer
is already supported: https://github.com/huggingface/swift-transformers/blob/24605a8c0cc974bec5b94a6752eb687bae77db31/Sources/Tokenizers/Tokenizer.swift#L254
Is it not working for you?
I'm not familiar with swift code. This looks like a stub where CodeGen is currently stubbed out to be equivalent to BPE.
Like many other tokenizers, CodeGenTokenizer uses the BPE tokenization model. The specific vocabulary and other model parameters are retrieved from configuration files downloaded from the Hub, when you instantiate the tokenizer for your model. For example, you could instantiate it for Phi 2 using code like the following:
let tokenizer = try await AutoTokenizer.from(pretrained: "microsoft/phi-2")
That would download the tokenization configuration files from https://huggingface.co/microsoft/phi-2/tree/main (tokenizer_config.json, tokenizer.json, vocab.json, ...) and create the tokenizer from them.
That should work, let me know if that's not the case.
Also note that there are still some precision issues when converting the Phi-2 model to Core ML. But the tokenizer should work.
Closing for now, feel free to reopen if tokenization does not work.
Primarily to be used with Phi-2