huggingface / swift-transformers

Swift Package to implement a transformers-like API in Swift
Apache License 2.0
657 stars 70 forks source link

SplitPreTokenizer with invert true returning array with empty string #55

Open davidkoski opened 7 months ago

davidkoski commented 7 months ago

See using commit 03d86ac See also: https://github.com/PreternaturalAI/mlx-swift-chat/issues/8

The tokenizer for stabilityai/stablelm-2-zephyr-1_6b has a configuration like this:

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": { 
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },

which ends up here:

class SplitPreTokenizer: PreTokenizer {
...
    func preTokenize(text: String) -> [String] {
        guard let pattern = pattern else { return [text] }
        return pattern.split(text, invert: invert)
    }

Given the input string "Why did the chicken cross the road? " it returns a array with an empty string:

(lldb) p pattern.split(text, invert: true)
([String]) 1 value {
  [0] = ""
}

I observed that if invert were false it gives something that look reasonable to my eyes:

(lldb) p pattern.split(text, invert: false)
([String]) 10 values {
  [0] = "Why"
  [1] = " did"
  [2] = " the"
  [3] = " chicken"
  [4] = " cross"
  [5] = " the"
  [6] = " road"
  [7] = "?"
  [8] = " "
  [9] = ""
}

I am not sure what the behavior is supposed to be here -- I wonder if the behavior of invert might be ... inverted? I think the configuration is correct because the python tokenizer behaves correctly.

pcuenca commented 6 months ago

Thanks for the report @davidkoski, I'll take a look!