turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.56k stars 273 forks source link

Support for Huggingface Fast Tokenizers #188

Closed bibekyess closed 10 months ago

bibekyess commented 10 months ago

Hello ! I found some Llama2 models using the FastTokenizer provided by the Hugging Face tokenizers library, not the SentencePiece package used by regular Llama models. For instance, beomi/llama-2-ko-7b. It seems the current ExLlamaV2Tokenizer only support SentencePiece tokenizer, which requires tokenizer.model, can you please add support for Hugging Face tokenizers as well? I tried changing the file tokenizer.py to accomplish it and it worked well in exllama but in exllamav2, it works well sometimes but sometimes it gives following errors.

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/routing.py", line 274, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/tabbyAPI/main.py", line 188, in generate_completion
    response_text = model_container.generate(data.prompt, **data.to_gen_params())
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/tabbyAPI/model.py", line 230, in generate
    reponse = "".join(gen)
              ^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/tabbyAPI/model.py", line 363, in generate_gen
    chunk, eos, tokens = self.generator.stream()
                         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/exllamav2/exllamav2/generator/streaming.py", line 155, in stream
    next_token, new_text = self._catch_utf8(next_token, new_text)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/exllamav2/exllamav2/generator/streaming.py", line 218, in _catch_utf8
    id_to_ord = self.tokenizer.get_id_to_ord_list()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/exllamav2/exllamav2/tokenizer.py", line 275, in get_id_to_ord_list
    match = self.ord_exp.match(p)
            ^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or bytes-like object, got 'int'
Response: 247 tokens generated in 7.81 seconds (31.63 T/s, context 9 tokens)
INFO:     127.0.0.1:47692 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:34096 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/routing.py", line 274, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/anaconda3/envs/exllamav2-env/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/tabbyAPI/main.py", line 188, in generate_completion
    response_text = model_container.generate(data.prompt, **data.to_gen_params())
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/tabbyAPI/model.py", line 230, in generate
    reponse = "".join(gen)
              ^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/tabbyAPI/model.py", line 363, in generate_gen
    chunk, eos, tokens = self.generator.stream()
                         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/exllamav2/exllamav2/generator/streaming.py", line 155, in stream
    next_token, new_text = self._catch_utf8(next_token, new_text)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bibekyess/exllama_v2_sandbox/exllamav2/exllamav2/generator/streaming.py", line 220, in _catch_utf8
    b = id_to_ord[t]
        ~~~~~~~~~^^^
IndexError: list index out of range

It is interesting that sometimes, the inference is successful, while sometimes it is not. So, I want to request for official support of the Huggingface Fast Tokenizers.

Thank you! :)

bibekyess commented 10 months ago

Maybe the issue arises because of TabbyAPI. I created my own FastAPI server and I am not facing such issues.

turboderp commented 10 months ago

I've been working on this for most of the day. I have HF tokenizers working more or less, there's just a few kinks to iron out because the EXL2 tokenizer does more than just wrap around SentencePiece. Also apparently bugs in the HF Tokenizer implementation that I have to work around. But getting there.

bibekyess commented 10 months ago

Great to hear that! Thank you! :)

turboderp commented 10 months ago

It should be working now. I've mainly focused on deepseek, and HF tokenization is a deep, deep rabbit hole, so I'll probably need to test a lot more models to fix various edge cases.

turboderp commented 10 months ago

It could be. How are you using the model?

turboderp commented 10 months ago

I know the system prompt is inconsistent across the deepseek models. You could try with ExUI which I've confirmed works well, at least with my own conversions of 67B-chat and the built-in deepseek prompt format.

turboderp commented 10 months ago

Okay, so I was able to test that model, and it seems to be fine. The issue you're having is probably that the model was finetuned with a RoPE scaling factor of 4, and ExLlamaV2 doesn't (yet?) automatically read that from the config. But if you run the chat example with -rs 4 it should work. It also seems to be a bit sensitive to repetition penalty, so I would lower it from the default (1.15) to something like -repp 1.05.

turboderp commented 10 months ago

As for running multiple queries on an empty context, I guess it would be a simple feature to add, but the chatbot isn't really meant to by cluttered with too many funky features. ExUI is more advanced with sessions, model loading/unloading, notepad mode etc.

turboderp commented 10 months ago

Well, like I said it's a simple thing, so I just added it, cause why not. Run the chatbot with --amnesia and it will forget the context after each response.

turboderp commented 10 months ago

Yes, there are two buttons I need to click. Ahem.. try it again. (:

turboderp commented 10 months ago

If you give it some programming task where it needs to generate two long functions the output starts glitching after 50 lines of code or so:

This could be related to the RoPE scaling. If the model was converted without that setting, the calibration is going to be very off.

turboderp commented 10 months ago

From what experiments I and others have done, the calibration dataset doesn't ultimately matter that much. But it's probably a good idea to have some code in there, if nothing else then to make sure all the "coding tokens" and their embeddings are accounted for.

As for presets, no. I don't really believe in samplers as a way to fix bad predictions from language models. So the default is just top-K+top-P, with a slight repetition penalty (which is probably a bit too high by default), and everything else is just there because people have requested it. Locally typical sampling has some good theory behind it, I guess?

turboderp commented 10 months ago

Cleaning up some issues, and this is technically completed.