+1 on this one, trying to get phi3 working on iOS and the tokenizer does not seem to recognise the added_tokens I the config, specifically the <|end|> token, so the model doesn't know when to stop.
Test code:
import Tokenizers
func testTokenizer(testString: String) async throws {
let tokenizer = try await AutoTokenizer.from(pretrained: "mlx-community/Phi-3-mini-128k-instruct-4bit")
let inputIds = tokenizer(testString)
for token in inputIds {
let decoded = tokenizer.decode(tokens: [token])
print("\(decoded)")
}
}
Output:
<s>
hello
<
|
end
|
>
as can be seen in the tokenizer.json on huggingface<|end|> can be found in the added_tokens, but the tokenizer does not seem to honor it and sees all parts of <|end|> as separate tokens.
+1 on this one, trying to get phi3 working on iOS and the tokenizer does not seem to recognise the
added_tokens
I the config, specifically the<|end|>
token, so the model doesn't know when to stop.Test code:
Output:
as can be seen in the tokenizer.json on huggingface
<|end|>
can be found in theadded_tokens
, but the tokenizer does not seem to honor it and sees all parts of<|end|>
as separate tokens.