Closed mzbac closed 3 months ago
Hi @mzbac! That's a good point. Yes, the current API is mostly focused on downloading from the Hub via the .from(pretrained:)
static method: https://github.com/huggingface/swift-transformers/blob/2eea3158b50ac7e99c9b5d4df60359daed9b832c/Sources/Tokenizers/Tokenizer.swift#L249. After the first download, the tokenizer files are cached locally, but I agree there should be an easy way to load from a local folder. Until that feature exists, you can load the tokenizer configuration files yourself, and then invoke this version of the loader.
Some of the project's unit tests go through this route, so you can see how it's done:
https://github.com/huggingface/swift-transformers/blob/main/Tests/TokenizersTests/TokenizerTests.swift#L117. In this test, the configuration is also loaded from the Hub via LanguageModelConfigurationFromHub
, but you could read the files from disk instead. Referring to this constructor, tokenizerConfig
should be loaded from the contents of tokenizer_config.json
, such as this one, which you'd have locally on disk. Similarly, tokenizerData
should be read from a tokenizer.json
file, like this one.
Thanks for the detailed explanation. I will play around a bit and see if I can make it work :)
Maybe I missed something, but I am currently working on a swift mlx server implementation following the examples at https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Tokenizer.swift#L7. From my understanding, it seems like the tokenizer configuration has to be loaded via the hub API. I'm wondering if there is a way to load the tokenizer from a local folder instead of fetching it from the Hugging Face repo?