subword `#` should be an option.

skeskinen / bert.cpp

ggml implementation of BERT

MIT License

464 stars 58 forks source link

subword `#` should be an option. #33

Open FFengIll opened 1 year ago

FFengIll commented 1 year ago

For bert, there are many models use # for subword symbol, but not all. Some popular bert-based models defined their own subword symbol.

For example, in e5 the symbol is ▁.

>>> a = '▁'
>>> a.encode('utf-8')
b'\xe2\x96\x81'

FFengIll commented 1 year ago

Furthermore, there is no such rule to force use #.

FFengIll commented 1 year ago

In model, the substr symbol always be called as replacement or continuing_subword_prefix. Actually, it will show in tokenizer.json.

skeskinen commented 1 year ago

Hi, I was wondering about the subword rules also with regards to https://github.com/skeskinen/bert.cpp/pull/31 I remember trying to get the tokens from the tokenizer, like you did in the PR. But I also remember having some issue with the subwords when I tried to do this.

Does the code in 31 handle subwords? Do you have an idea on how to handle models like e5?

Also, unrelated but a thought I had earlier: it would be nice to convert test_tokenizer.cpp to python and run the tests against the reference tokenizers

FFengIll commented 1 year ago

@skeskinen no, #31 only make vocab not necessary (because it maybe missing).

This issue is another problem for subwords ( I found this since I meet too many unknown token when using e5).

bellow is some token samples in bert-based model.

in m3e, subword is ## like many bert model.

"##a": 8139,
"03": 8140,
"09": 8141,
"08": 8142,
"28": 8143,
"##2": 8144,

in e5, subword is ▁ since they trained a new tokenizer (bellow is part copy from tokenizer.json)

      [
        "▁si",
        -7.355116367340088
      ],
      [
        "▁ja",
        -7.370460510253906
      ],
      [
        "▁za",
        -7.37307596206665
      ],
      [
        "▁v",
        -7.385393142700195
      ],

FFengIll commented 1 year ago

For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.

cgisky1980 commented 1 year ago

For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.

加油，需要跨平台的中英文向量化~ E5 多语言版就不错