LLMEval crashing due to force unwrapping of optional in token decode for Gemma

ml-explore / mlx-swift-examples

Examples using MLX Swift

MIT License

1k stars 107 forks source link

LLMEval crashing due to force unwrapping of optional in token decode for Gemma #50

Open ViRo3 opened 7 months ago

ViRo3 commented 7 months ago

st.txt is the crash log.

Model : Gemma 2B Quantized Prompt : "Write code to boot a raspberry pico"

davidkoski commented 7 months ago

This looks a bit like:

What version (hash) of swift-transformers are you using?

ViRo3 commented 7 months ago

74b9421

davidkoski commented 7 months ago

OK, that is the current head of main and includes that fix. Oh, interesting -- that particular prompt does reproduce it for me!

Here is the token in question:

(lldb) p model.convertIdToToken(235345)
(String?) nil

thought it looks like it is defined in the tokenizer.json:

      "#": 235345,

The problem is that there are two (ish) #'s:

egrep '122661|235345' tokenizer.json
      "#": 122661,
      "#": 235345,

though one of them has some extra unicode, in particular a ZERO WIDTH NO-BREAK SPACE:

egrep '122661|235345' tokenizer.json | od -c
0000000                            " 357 273 277   #   "   :       1   2
0000020    2   6   6   1   ,  \n                           "   #   "   :
0000040        2   3   5   3   4   5   ,  \n                            
0000051

It looks like the strings map to the same value on read and the tokenizer model loses the entry for 235345.

Blaizzy commented 7 months ago

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

ViRo3 commented 7 months ago

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

I have checked and it persists and is a swift-transformer issue so filed an issue there too.

davidkoski commented 7 months ago

I had a thought about how this might be fixed -- we might be able to make something along these lines:

struct CodePointString : StringProtocol {
    let value: String

    static func ==(lhs: CodePointString, rhs: CodePointString) {
        // this isn't actually comparable but this is the idea
        lhs.utf16 == rhs.utf16
    }

    func hash(hasher: inout Hasher) {
        hasher.combine(value.utf16)
    }

    ...
}

Then we load the config as [CodePointString:Any]. Something like this would let us treat the strings as code points (which is what python, etc. do) rather than unicode strings. Maybe :-)