tokenizer bug - Githubissues

qweszxc7410 commented 5 days ago

I changed the llm_model_path to 'yentinglin/Llama-3-Taiwan-8B-Instruct'. Then the bug happened. It seems that the Llama-3-Taiwan-8B-Instruct tokenizer.json does not contain "<0xE8>". GFD is based on "bytes". Is it possible to fix this, or is it not the main reason? Thx

======================================== [0] asr_score=-0.80078125, llm_score=-6.636518955230713,fuse_score=-1.9679287910461427, 各位

Traceback (most recent call last): File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 40, in main()
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 33, in main result = model.get_transcription(args.audio_file_path) File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 117, in get_transcription File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 245, in _get_transcription File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/tokenizer.py", line 75, in tokenize_from_byte KeyError: b'\xe8'

Splend1d commented 5 days ago

Hi,

Thank you for your interest in this project! Currently, only Breeze and Mistral models are supported (please refer to the "Warning" section). The reason being that we need a "byte tokenization" method for the algorithm. Different tokenizations represent tokens in different ways, and we have not found a way to systematically patch this feature for all models, so we chose to only support Mistral and Breeze.

In short, this is not a bug, and we probably will not patch it soon as the list of models is endless. But we encourage you to do so! It is not that complicated. All you have to do is to create a custom tokenizer that supports the byte-functions that we have implemented.

Best, Jeff

qweszxc7410 commented 4 days ago

Thank you for your explanation.

mtkresearch / generative-fusion-decoding

tokenizer bug #4