huggingface / swift-transformers

Swift Package to implement a transformers-like API in Swift
Apache License 2.0
536 stars 46 forks source link

Ensure added tokens are supported #92

Closed pcuenca closed 2 months ago

DanThePutzer commented 2 months ago

+1 on this one, trying to get phi3 working on iOS and the tokenizer does not seem to recognise the added_tokens I the config, specifically the <|end|> token, so the model doesn't know when to stop.

Test code:

import Tokenizers

func testTokenizer(testString: String) async throws {
    let tokenizer = try await AutoTokenizer.from(pretrained: "mlx-community/Phi-3-mini-128k-instruct-4bit")
    let inputIds = tokenizer(testString)

    for token in inputIds {
        let decoded = tokenizer.decode(tokens: [token])
        print("\(decoded)")
    }
}

Output:

<s>
hello
<
|
end
|
>

as can be seen in the tokenizer.json on huggingface <|end|> can be found in the added_tokens, but the tokenizer does not seem to honor it and sees all parts of <|end|> as separate tokens.