openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.31k stars 833 forks source link

Thread Panic when decoding token id 100256 and others with cl100k_base tokenizer #47

Closed minimaxir closed 1 month ago

minimaxir commented 1 year ago

Code example:

enc = tiktoken.get_encoding("cl100k_base")
enc.decode([100256])

Trace:

thread '<unnamed>' panicked at 'no entry found for key', src[/lib.rs:210:37](https://file+.vscode-resource.vscode-cdn.net/lib.rs:210:37)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
[/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py](https://file+.vscode-resource.vscode-cdn.net/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py) in 
      1 enc = tiktoken.get_encoding("cl100k_base")
----> 2 enc.decode([100256])

[/usr/local/lib/python3.9/site-packages/tiktoken/core.py](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.9/site-packages/tiktoken/core.py) in decode(self, tokens, errors)
    237         ```
    238         """
--> 239         return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
    240 
    241     def decode_single_token_bytes(self, token: int) -> bytes:

PanicException: no entry found for key

Also reproduces for token ids 100261 through 100275

If tokens are intentionally empty, they should still not cause a panic.

dbl001 commented 1 year ago

I get the same exception.

ults of the COVID-2. For this results. In the first-19 to the results of the study, the COVID-19, and a study, as the pandemic, the first-19 and the first to the first-CoV--19 and a same, we also been been been a significant. A. It is
---------------
thread '<unnamed>' panicked at 'no entry found for key', src/lib.rs:155:37
stack backtrace:
   0:        0x105835d42 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h8d94e552d95b28cc
   1:        0x105849f6a - core::fmt::write::h421d4212716e9716
   2:        0x105833bac - std::io::Write::write_fmt::hdc28b71c2d62dad8
   3:        0x105835b0a - std::sys_common::backtrace::print::he11eab6b959c3b5b
   4:        0x105836ee6 - std::panicking::default_hook::{{closure}}::ha68ba8cbe26bbbe3
   5:        0x105836c37 - std::panicking::default_hook::h5cf85224a4df5bc6
   6:        0x10583762d - std::panicking::rust_panic_with_hook::hed342721bf9addfa
   7:        0x1058373e3 - std::panicking::begin_panic_handler::{{closure}}::h3d9af89e51f2fba9
   8:        0x1058361d8 - std::sys_common::backtrace::__rust_end_short_backtrace::hfb9719355016e93f
   9:        0x1058370ad - _rust_begin_unwind
  10:        0x10585af43 - core::panicking::panic_fmt::h1965fc2159be50bb
  11:        0x10584911b - core::panicking::panic_display::h841c2aac0ae11b23
  12:        0x1058490cc - core::panicking::panic_str::ha2b2b46922a69871
  13:        0x10585af09 - core::option::expect_failed::h5dc600f0ba669ad7
  14:        0x1057739e4 - _tiktoken::CoreBPE::_decode_native::hf970f41e2ffb103d
  15:        0x10576624b - pyo3::marker::Python::allow_threads::h9399c4884f71c380
  16:        0x10577705d - _tiktoken::CoreBPE::decode_bytes::hac2ea10696677c55
  17:        0x10576e572 - std::panicking::try::hdddd1e2b25b9d596
  18:        0x10577816e - _tiktoken::_::<impl _tiktoken::CoreBPE>::__pymethod_decode_bytes__::h7364fbad820d3301
  19:        0x1017d9ecf - _method_vectorcall_FASTCALL_KEYWORDS
  20:        0x1018e83ae - __PyEval_EvalFrameDefault
  21:        0x1017ca7f6 - __PyFunction_Vectorcall
  22:        0x1018e83ae - __PyEval_EvalFrameDefault
  23:        0x1017ca7f6 - __PyFunction_Vectorcall
  24:        0x1019107db - _call_function
  25:        0x1018e1d84 - __PyEval_EvalFrameDefault
  26:        0x1018ddb91 - __PyEval_Vector
  27:        0x101966460 - _run_mod
  28:        0x101966225 - _pyrun_file
  29:        0x101965d76 - __PyRun_SimpleFileObject
  30:        0x10196569f - __PyRun_AnyFileObject
  31:        0x10198a978 - _pymain_run_file_obj
  32:        0x10198a305 - _pymain_run_file
  33:        0x101989b38 - _pymain_run_python
  34:        0x101989975 - _Py_RunMain
  35:        0x101762598 - _main
  36:     0x7ff809a49310 - <unknown>
Traceback (most recent call last):
  File "/Users/davidlaxer/nanoGPT/sample.py", line 93, in <module>
    print(decode(y[0].tolist()))
  File "/Users/davidlaxer/nanoGPT/sample.py", line 79, in <lambda>
    decode = lambda l: enc.decode(l)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/tiktoken/core.py", line 239, in decode
    return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
pyo3_runtime.PanicException: no entry found for key
Screenshot 2023-03-05 at 7 19 40 AM

I'm running 'nanoGPT'

https://github.com/karpathy/nanoGPT

% RUST_BACKTRACE=full  python sample.py --out_dir=out --device='cpu' --compile=False

My error is in a list of 501 tokens. I'm not sure which one(s) are causing the exception.

Screenshot 2023-03-05 at 7 25 14 AM
wzjin2017 commented 3 months ago

Any updates on this exception?

hauntsaninja commented 1 month ago

On tiktoken 0.8 this raises a more normal Python exception (KeyError)