Open FFengIll opened 1 year ago
Furthermore, there is no such rule to force use #
.
In model, the substr symbol always be called as replacement
or continuing_subword_prefix
.
Actually, it will show in tokenizer.json
.
Hi, I was wondering about the subword rules also with regards to https://github.com/skeskinen/bert.cpp/pull/31 I remember trying to get the tokens from the tokenizer, like you did in the PR. But I also remember having some issue with the subwords when I tried to do this.
Does the code in 31 handle subwords? Do you have an idea on how to handle models like e5?
Also, unrelated but a thought I had earlier: it would be nice to convert test_tokenizer.cpp to python and run the tests against the reference tokenizers
@skeskinen no, #31 only make vocab not necessary (because it maybe missing).
This issue is another problem for subwords ( I found this since I meet too many unknown token when using e5).
bellow is some token samples in bert-based model.
in m3e, subword is ## like many bert model.
"##a": 8139,
"03": 8140,
"09": 8141,
"08": 8142,
"28": 8143,
"##2": 8144,
in e5, subword is ▁
since they trained a new tokenizer (bellow is part copy from tokenizer.json)
[
"▁si",
-7.355116367340088
],
[
"▁ja",
-7.370460510253906
],
[
"▁za",
-7.37307596206665
],
[
"▁v",
-7.385393142700195
],
For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.
For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.
加油,需要跨平台的中英文向量化~ E5 多语言版就不错
For bert, there are many models use
#
for subword symbol, but not all. Some popular bert-based models defined their own subword symbol.For example, in
e5
the symbol is▁
.