Open davidkoski opened 7 months ago
See using commit 03d86ac See also: https://github.com/PreternaturalAI/mlx-swift-chat/issues/8
The tokenizer for stabilityai/stablelm-2-zephyr-1_6b has a configuration like this:
stabilityai/stablelm-2-zephyr-1_6b
"pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Removed", "invert": true },
which ends up here:
class SplitPreTokenizer: PreTokenizer { ... func preTokenize(text: String) -> [String] { guard let pattern = pattern else { return [text] } return pattern.split(text, invert: invert) }
Given the input string "Why did the chicken cross the road? " it returns a array with an empty string:
"Why did the chicken cross the road? "
(lldb) p pattern.split(text, invert: true) ([String]) 1 value { [0] = "" }
I observed that if invert were false it gives something that look reasonable to my eyes:
false
(lldb) p pattern.split(text, invert: false) ([String]) 10 values { [0] = "Why" [1] = " did" [2] = " the" [3] = " chicken" [4] = " cross" [5] = " the" [6] = " road" [7] = "?" [8] = " " [9] = "" }
I am not sure what the behavior is supposed to be here -- I wonder if the behavior of invert might be ... inverted? I think the configuration is correct because the python tokenizer behaves correctly.
invert
Thanks for the report @davidkoski, I'll take a look!
See using commit 03d86ac See also: https://github.com/PreternaturalAI/mlx-swift-chat/issues/8
The tokenizer for
stabilityai/stablelm-2-zephyr-1_6b
has a configuration like this:which ends up here:
Given the input string
"Why did the chicken cross the road? "
it returns a array with an empty string:I observed that if invert were
false
it gives something that look reasonable to my eyes:I am not sure what the behavior is supposed to be here -- I wonder if the behavior of
invert
might be ... inverted? I think the configuration is correct because the python tokenizer behaves correctly.