Open Crazytieguy opened 10 months ago
I've check every single JSON file in the MATH and AMPS pretraining datasets, all of them were passing this test.
def test_math_roundtrips():
enc = tiktoken.get_encoding("cl100k_base")
base_dir = '.../Downloads/amps'
for dirpath, _, filenames in os.walk(base_dir):
for filename in filenames:
if filename.endswith(".json"):
print(f'Checking ${filename}!')
with open(os.path.join(dirpath, filename), 'rb') as f:
content = f.read().decode('utf-8')
encoded_content = enc.encode(content)
decoded_content = enc.decode(encoded_content)
assert content == decoded_content, f"Roundtrip mismatch for {filename}"
Can you please attach the file and a simple reproducer that fails?
Here's a simple repro of the problem:
import tiktoken
enc = tiktoken.get_encoding("r50k_base")
enc.encode("^" * 1000000)
There's a stack overflow in the Rust regex library we use. I haven't yet gotten a chance to see if it's easy to fix the Rust (fancy-)regex library, but one workaround would be to raise a more specific exception, catch it, and fall back to using the Python regex
library to split the string as done in this private method: https://github.com/openai/tiktoken/blob/main/tiktoken/core.py#L360
Fixed init in my recently pushed PRs for cl100k_base
(see https://github.com/openai/tiktoken/pull/234 and https://github.com/openai/tiktoken/pull/239) - and the backported possessives quantifiers to the legacy encoding in https://github.com/openai/tiktoken/pull/258
Cherrypicking the PRs on top of each other not just passes, but is quite fast as well:
Hi, I have the same crash trying to tokenize this web page
https://www.mathauditor.com/2147483647-in-english.html
There is a roman representation of the number 2147483647, which contains about 2,147,483 letters M in one source token
I have the same issue. The error according to ChatGPT why we cannot catch the exception in python: _The error message you are seeing is because a Rust panic unwinding to Python is not an actual Python exception, and thus cannot be caught by a standard Python exception handler. It completely aborts the Python interpreter.
This is an issue with the Rust-Python interoperability. A Python program can't catch Rust panics because Rust panics are designed to unwind the stack, cleaning up as they go, until they reach the application boundary, at which point the application aborts. In this case, the application boundary is the Rust-Python boundary, so the panic unwinds the Rust stack, crosses the boundary, and causes the Python interpreter to abort.
However, latest PyO3 versions have a feature that allows converting Rust panics into Python exceptions, but it's opt-in and has to be enabled in the Rust library. The Python dependency using PyO3 can then be rerun so that Rust panics will become catchable Python RuntimeError exceptions.
Unfortunately, it seems like the tiktoken library doesn't use a new enough PyO3 version or has this feature disabled, and so it doesn't convert Rust panics into Python exceptions. Therefore, you can't catch the panic in Python code and the Python interpreter aborts.
For now, you could try to ensure your code never causes a panic in tiktoken. For instance, by checking properties of the input before passing it to tiktoken methods that might cause a panic.
Otherwise, this is a pretty hard issue to work around from Python. The best way to resolve it would be to open an issue on the repository of the library causing the issue or ask the maintainer to upgrade the PyO3 dependency and enable panic conversion._
You could add the PRs mentioned in https://github.com/openai/tiktoken/issues/245#issuecomment-1937894067 and build a custom TikToken version that supports big tokens. @hauntsaninja, do you think we could merge some of those PRs instead?
What I did, I tested a safe long token length about 200000 chars. Then before calling the tiktoken, I've splited the input into batches, called the tiktoken separately on each batch and concatenated token arrays. The result is still correct if you decode the tokenized string. Theoretically suboptimal for LLM use, but in practice no difference.
Before that I was kindof searching runs of 200000 without spaces to insert batch break if needed. But later I deleted the code as not being truly useful.
Hi, I'm getting a panic when trying to encode the attached file with the gpt-4 tokenizer. This is from the AMPS dataset that was published along with the MATH dataset. Backtrace: