Open ViRo3 opened 7 months ago
This looks a bit like:
What version (hash) of swift-transformers
are you using?
74b9421
OK, that is the current head of main and includes that fix. Oh, interesting -- that particular prompt does reproduce it for me!
Here is the token in question:
(lldb) p model.convertIdToToken(235345)
(String?) nil
thought it looks like it is defined in the tokenizer.json:
"#": 235345,
The problem is that there are two (ish) #'s:
egrep '122661|235345' tokenizer.json
"#": 122661,
"#": 235345,
though one of them has some extra unicode, in particular a ZERO WIDTH NO-BREAK SPACE:
egrep '122661|235345' tokenizer.json | od -c
0000000 " 357 273 277 # " : 1 2
0000020 2 6 6 1 , \n " # " :
0000040 2 3 5 3 4 5 , \n
0000051
It looks like the strings map to the same value on read and the tokenizer model loses the entry for 235345.
@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.
Btw this might be a huggingface swift-transformers issue and not MLX.
@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.
Btw this might be a huggingface swift-transformers issue and not MLX.
I have checked and it persists and is a swift-transformer issue so filed an issue there too.
I had a thought about how this might be fixed -- we might be able to make something along these lines:
struct CodePointString : StringProtocol {
let value: String
static func ==(lhs: CodePointString, rhs: CodePointString) {
// this isn't actually comparable but this is the idea
lhs.utf16 == rhs.utf16
}
func hash(hasher: inout Hasher) {
hasher.combine(value.utf16)
}
...
}
Then we load the config as [CodePointString:Any]
. Something like this would let us treat the strings as code points (which is what python, etc. do) rather than unicode strings. Maybe :-)
st.txt is the crash log.
Model : Gemma 2B Quantized Prompt : "Write code to boot a raspberry pico"