Add unit tests - Githubissues

snowyu commented 7 months ago

default bge-large-zh-v1.5 to test, see tests/CMakeLists.txt#TEST_MODEL_NAME to change.

test tokens in test_prompts.txt

run unit test: make -C build test

It will:

Generate the HF's tokens to compare in hf_tokenized_ids.txt(require python, AutoTokenizer installed)
test bert_cpp generated tokens with hf_tokenized_ids.txt

iamlemec commented 7 months ago

@snowyu Sorry, I was just putting into the new testing infrastructure! Check it out in tests. I think it'll make it a lot easier to fill out the test suite. Feel free to add any additional strings or tests you think are needed.

This is all through the Python API. I'm not too worried about if being different from the underlying C++. It's a really thin wrapper for tokenization especially. And if something breaks, it'll show up different from huggingface anyway.

And incidentally, these tests all work for bge-base-en-v1.5 and bge-base-zh-v1.5 but they fail for bge-m3, so I guess something with pre-tokenization or vocab loading is off there.

snowyu commented 7 months ago

I thought that converting text to token is only necessary when embedding (bert_forward). If you are using python, js or rust, you can directly use the official tokenizer. It is not needed for cpp. This situation will continue until llama.cpp completely implements HF's dynamic tokenizer.

The current problem is that the gguf format of llama.cpp lacks the necessary configuration required by the tokenizer. See: https://github.com/ggerganov/llama.cpp/pull/5143

Regarding bge-m3's tokenizer , it has special processing on the normalizer and is also processed in the subsequent post_processor, see its tokenizer.json.

iamlemec commented 7 months ago

We still need to support tokenization on the cpp side of things. People aren't always going to want to use the Python interface, if they're doing inference on edge devices for instance. In principle, we could add a use_hf option to the Python tokenizer interface, since we already require the library for GGUF conversion.

I'm not sure putting the full tokenizer config in the GGUF file is the best option here. In the end, the difficult part is implementing the tokenizer logic in C++. As we get more coverage, we can include tokenizer params that reflect that in the GGUF and try to do as much pre-processing in Python land on conversion. Realistically, there aren't that many tokenizer types out there. I think we should err on the side of keeping things as simple as possible. That said, if there was a straightforward tokenizer.cpp library out there, I'd be happy to make it a dependency!

As for bge-m3, yeah it has a "precompiled_charsmap" in there. Looking at transformers.js, I think they actually hard-code a default behavior there and don't use the actual charsmap. But the Rust implementaiton in HF tokenizers doesn't seem too bad. It would be nice to know if the approach used in transformers.js works in almost all cases, particularly bge-m3.

snowyu commented 7 months ago

We still need to support tokenization on the cpp side of things.

Yes, indeed. But we need a workaround currently.

I'm not sure putting the full tokenizer config in the GGUF file is the best option here.

What's the best options? We now need a way to continue. If GGUF does not embed tokenizer related parameters, how to proceed? Whether implementing tokenizer in CPP or using tokenizer in other languages, parameters are required. If there are no necessary parameters and Still need to use gguf, the only ugly way is to copy "tokenizer.json" and "tokenizer_config.json" to the directory where the GGUF file is located.

Therefore, I thought that letting the tokenizer parameter exist in GGUF is the first step, and implementing the tokenizer is the second step. In fact, some parameters are already in the GGUF file, and you are already using them.

Go back to bge-m3, and you will find that there are still missing parameters and cannot be tokenized. If you use sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, you will find that the pre_tokenizer array parameter is missing too.

I don't understand why there is a problem with pre-writing tokenizer parameters that are temporarily unused?

Btw the transformer.js has the Precompiled Normalizer, and bge-m3 works well on transformer.js. See https://github.com/xenova/transformers.js/blob/41f98b761f183b1cd27d6cf461e5843e0af8caf7/src/tokenizers.js#L2234

iamlemec / bert.cpp

Add unit tests #9