openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.52k stars 856 forks source link

Panic (stack overflow) when encoding a certain string #245

Open Crazytieguy opened 10 months ago

Crazytieguy commented 10 months ago

Hi, I'm getting a panic when trying to encode the attached file with the gpt-4 tokenizer. This is from the AMPS dataset that was published along with the MATH dataset. Backtrace:


called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/result.rs:1652:5
   3: _tiktoken::CoreBPE::_encode_native
   4: _tiktoken::_::<impl _tiktoken::CoreBPE>::__pymethod_encode__
   5: pyo3::impl_::trampoline::fastcall_with_keywords
   6: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5258:29
   7: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
   8: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
   9: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  10: _PyObject_VectorcallTstate
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_call.h:92:11
  11: method_vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/classobject.c:89:18
  12: do_call_core
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:7357:12
  13: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5381:22
  14: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
  15: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
  16: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  17: do_call_core
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:7357:12
  18: _PyEval_EvalFrameDefault
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:5381:22
  19: _PyEval_EvalFrame
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_ceval.h:73:16
  20: _PyEval_Vector
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/ceval.c:6439:24
  21: _PyFunction_Vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/call.c:393:16
  22: _PyObject_VectorcallTstate
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Include/internal/pycore_call.h:92:11
  23: method_vectorcall
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Objects/classobject.c:67:20
  24: thread_run
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/./Modules/_threadmodule.c:1092
  25: pythread_wrapper
             at /tmp/python-build.20230808162458.7883/Python-3.11.4/Python/thread_pthread.h:241:5
  26: <unknown>
  27: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.```
l0rinc commented 9 months ago

I've check every single JSON file in the MATH and AMPS pretraining datasets, all of them were passing this test.

def test_math_roundtrips():
    enc = tiktoken.get_encoding("cl100k_base")
    base_dir = '.../Downloads/amps'

    for dirpath, _, filenames in os.walk(base_dir):
        for filename in filenames:
            if filename.endswith(".json"):
                print(f'Checking ${filename}!')

                with open(os.path.join(dirpath, filename), 'rb') as f:
                    content = f.read().decode('utf-8')
                    encoded_content = enc.encode(content)
                    decoded_content = enc.decode(encoded_content)

                    assert content == decoded_content, f"Roundtrip mismatch for {filename}"

Can you please attach the file and a simple reproducer that fails?

hauntsaninja commented 9 months ago

Here's a simple repro of the problem:

import tiktoken
enc = tiktoken.get_encoding("r50k_base")
enc.encode("^" * 1000000)

There's a stack overflow in the Rust regex library we use. I haven't yet gotten a chance to see if it's easy to fix the Rust (fancy-)regex library, but one workaround would be to raise a more specific exception, catch it, and fall back to using the Python regex library to split the string as done in this private method: https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L360

l0rinc commented 9 months ago

Fixed init in my recently pushed PRs for cl100k_base (see https://github.com/openai/tiktoken/pull/234 and https://github.com/openai/tiktoken/pull/239) - and the backported possessives quantifiers to the legacy encoding in https://github.com/openai/tiktoken/pull/258

Cherrypicking the PRs on top of each other not just passes, but is quite fast as well:

image
fedor57 commented 4 months ago

Hi, I have the same crash trying to tokenize this web page

https://www.mathauditor.com/2147483647-in-english.html

There is a roman representation of the number 2147483647, which contains about 2,147,483 letters M in one source token

thijs-hakkenberg commented 2 months ago

I have the same issue. The error according to ChatGPT why we cannot catch the exception in python: _The error message you are seeing is because a Rust panic unwinding to Python is not an actual Python exception, and thus cannot be caught by a standard Python exception handler. It completely aborts the Python interpreter.

This is an issue with the Rust-Python interoperability. A Python program can't catch Rust panics because Rust panics are designed to unwind the stack, cleaning up as they go, until they reach the application boundary, at which point the application aborts. In this case, the application boundary is the Rust-Python boundary, so the panic unwinds the Rust stack, crosses the boundary, and causes the Python interpreter to abort.

However, latest PyO3 versions have a feature that allows converting Rust panics into Python exceptions, but it's opt-in and has to be enabled in the Rust library. The Python dependency using PyO3 can then be rerun so that Rust panics will become catchable Python RuntimeError exceptions.

Unfortunately, it seems like the tiktoken library doesn't use a new enough PyO3 version or has this feature disabled, and so it doesn't convert Rust panics into Python exceptions. Therefore, you can't catch the panic in Python code and the Python interpreter aborts.

For now, you could try to ensure your code never causes a panic in tiktoken. For instance, by checking properties of the input before passing it to tiktoken methods that might cause a panic.

Otherwise, this is a pretty hard issue to work around from Python. The best way to resolve it would be to open an issue on the repository of the library causing the issue or ask the maintainer to upgrade the PyO3 dependency and enable panic conversion._

l0rinc commented 2 months ago

You could add the PRs mentioned in https://github.com/openai/tiktoken/issues/245#issuecomment-1937894067 and build a custom TikToken version that supports big tokens. @hauntsaninja, do you think we could merge some of those PRs instead?

fedor57 commented 2 months ago

What I did, I tested a safe long token length about 200000 chars. Then before calling the tiktoken, I've splited the input into batches, called the tiktoken separately on each batch and concatenated token arrays. The result is still correct if you decode the tokenized string. Theoretically suboptimal for LLM use, but in practice no difference.

Before that I was kindof searching runs of 200000 without spaces to insert batch break if needed. But later I deleted the code as not being truly useful.