huggingface / swift-transformers

Swift Package to implement a transformers-like API in Swift
Apache License 2.0
536 stars 46 forks source link

Tokenizer models: CodeGenTokenizer #66

Closed taylorgoolsby closed 3 months ago

taylorgoolsby commented 3 months ago

Primarily to be used with Phi-2

pcuenca commented 3 months ago

Hi @taylorgoolsby! CodeGenTokenizer is already supported: https://github.com/huggingface/swift-transformers/blob/24605a8c0cc974bec5b94a6752eb687bae77db31/Sources/Tokenizers/Tokenizer.swift#L254

Is it not working for you?

taylorgoolsby commented 3 months ago

I'm not familiar with swift code. This looks like a stub where CodeGen is currently stubbed out to be equivalent to BPE.

pcuenca commented 3 months ago

Like many other tokenizers, CodeGenTokenizer uses the BPE tokenization model. The specific vocabulary and other model parameters are retrieved from configuration files downloaded from the Hub, when you instantiate the tokenizer for your model. For example, you could instantiate it for Phi 2 using code like the following:

let tokenizer = try await AutoTokenizer.from(pretrained: "microsoft/phi-2")

That would download the tokenization configuration files from https://huggingface.co/microsoft/phi-2/tree/main (tokenizer_config.json, tokenizer.json, vocab.json, ...) and create the tokenizer from them.

That should work, let me know if that's not the case.

Also note that there are still some precision issues when converting the Phi-2 model to Core ML. But the tokenizer should work.

Closing for now, feel free to reopen if tokenization does not work.