Support GPT2 tokenizer for CausalLM 72b

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.54k stars 274 forks source link

Support GPT2 tokenizer for CausalLM 72b #196

Closed CyberTimon closed 10 months ago

CyberTimon commented 10 months ago

Hello! When trying to load CausalLM 72b with exllamav2 to quant or infer, you will get this error:

  File "/home/ubuntu/git/exllamav2/exllamav2/tokenizer.py", line 67, in __init__
    else: raise FileNotFoundError("No supported tokenizer found.")
FileNotFoundError: No supported tokenizer found.

But you can load the tokenizer vocab as a GPT2 tokenizer with AutoTokenizers. With the recent Deepseek tokenizer additon, I don't think it will be hard to support also this one - or am I wrong?

Kind regards, Timon Käch

Ref: https://huggingface.co/CausalLM/72B-preview/discussions/2 CausalLM-72B: https://huggingface.co/CausalLM/72B-preview

turboderp commented 10 months ago

I don't know since I can't look at the files. If there's a tokenizer.json file, ExLlamaV2 should now be able to read that and use it as long as the Tokenizers package is installed. And if the model has been successfully Llamafied, everything should just work.

If it still has the same TikToken model file as the original Qwen model, that may take some additional work. I can investigate once I'm able to access the model. (No, I'm not clicking the button. :stuck_out_tongue:)

CyberTimon commented 10 months ago

Sure, thanks for the answer. The model has a tokenizer_config.json file which looks like this:

{
  "add_prefix_space": false,
  "bos_token": "<|endoftext|>",
  "tokenizer_class": "GPT2Tokenizer",
  "clean_up_tokenization_spaces": true,
  "eos_token": "<|endoftext|>",
  "model_max_length": 1000000000000000019884624838656,
  "unk_token": "<|endoftext|>"
}

and a vocab.json, which is quite big and has following structure:

{
  "!": 0,
  "\"": 1,
  "#": 2,
  "$": 3,
  "%": 4,
  "&": 5,
  "'": 6,
  ...

a additional_tokens.json file.

Does this look like something that can be loaded? Thank you 😄

CyberTimon commented 10 months ago

Does this error just happen because it only searches for tokenizer.json / .model? and after adding vocab.json it should work? I'm still downloading the model atm that's why I can't test it but I know this error will pop up as other people have it.

turboderp commented 10 months ago

It loads the tokenizer.model file using SentencePiece or tokenizer.json using Tokenizers. It's likely that Tokenizers can also work with a vocab.json file, maybe initializing from some of the information in tokenizer_config.json? Hard to say. It's not like any of this is really standardized.

CyberTimon commented 10 months ago

I will try it when it finished downloading. Also, check here if you want to see how the gpt2 tokenizer works: https://huggingface.co/gpt2/tree/main. CausalLM 72b has basically the same tokenizer config and vocab

CyberTimon commented 10 months ago

Closing as the model has other issues and has nothing todo with this repo