huggingface / swift-transformers

Swift Package to implement a transformers-like API in Swift
Apache License 2.0
536 stars 46 forks source link

Support loading tokenizer from local folder #76

Closed mzbac closed 3 months ago

mzbac commented 3 months ago

Maybe I missed something, but I am currently working on a swift mlx server implementation following the examples at https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Tokenizer.swift#L7. From my understanding, it seems like the tokenizer configuration has to be loaded via the hub API. I'm wondering if there is a way to load the tokenizer from a local folder instead of fetching it from the Hugging Face repo?

pcuenca commented 3 months ago

Hi @mzbac! That's a good point. Yes, the current API is mostly focused on downloading from the Hub via the .from(pretrained:) static method: https://github.com/huggingface/swift-transformers/blob/2eea3158b50ac7e99c9b5d4df60359daed9b832c/Sources/Tokenizers/Tokenizer.swift#L249. After the first download, the tokenizer files are cached locally, but I agree there should be an easy way to load from a local folder. Until that feature exists, you can load the tokenizer configuration files yourself, and then invoke this version of the loader.

Some of the project's unit tests go through this route, so you can see how it's done: https://github.com/huggingface/swift-transformers/blob/main/Tests/TokenizersTests/TokenizerTests.swift#L117. In this test, the configuration is also loaded from the Hub via LanguageModelConfigurationFromHub, but you could read the files from disk instead. Referring to this constructor, tokenizerConfig should be loaded from the contents of tokenizer_config.json, such as this one, which you'd have locally on disk. Similarly, tokenizerData should be read from a tokenizer.json file, like this one.

mzbac commented 3 months ago

Thanks for the detailed explanation. I will play around a bit and see if I can make it work :)