simonw / ttok

Count and truncate text based on tokens
Apache License 2.0
248 stars 7 forks source link

Support for other models via Hugging Face tokenizers #9

Open simonw opened 11 months ago

simonw commented 11 months ago

Refs:

simonw commented 11 months ago

This works already:

$ ttok 'hello world' -m 'hf:TheBloke/Llama-2-70B-fp16'              
3
$ ttok 'hello world' -m 'hf:TheBloke/Llama-2-70B-fp16' --encode
1 22172 3186
$ ttok 'hello world' -m 'hf:TheBloke/Llama-2-70B-fp16' --truncate 2 
hello%                                                                                                                                    $ 
simonw commented 11 months ago

This error could be nicer:

$ ttok 'hello world' -m 'hf:TheBloke/Llama-2-70B-fp1621'
Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 261, in hf_raise_for_status
    response.raise_for_status()
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/TheBloke/Llama-2-70B-fp1621/resolve/main/tokenizer.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/bin/ttok", line 33, in <module>
    sys.exit(load_entry_point('ttok', 'console_scripts', 'ttok')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/Dropbox/Development/ttok/ttok/cli.py", line 80, in cli
    hf_tokenizer = tokenizers.Tokenizer.from_pretrained(model[3:])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1195, in hf_hub_download
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1541, in get_hf_file_metadata
    hf_raise_for_status(r)
  File "/Users/simon/.local/share/virtualenvs/ttok-WqiqFHFP/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 293, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-64fccbd6-6d87f16a3a8346f22a18dff0;f1db440b-2202-43a4-a1e1-2b8792fcb813)

Repository Not Found for url: https://huggingface.co/TheBloke/Llama-2-70B-fp1621/resolve/main/tokenizer.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.