jimmc414 / 1filellm

Specify a github or local repo, github pull request, arXiv or Sci-Hub paper, Youtube transcript or documentation URL on the web and scrape into a text file and clipboard for easier LLM ingestion
MIT License
432 stars 41 forks source link

Tiktoken core encoding error #13

Closed dickiesanders closed 3 months ago

dickiesanders commented 4 months ago

Disallowing or allowing special characters does not appear to work.


called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/richard.sanders/Documents/ulterior/1filellm/onefilellm.py", line 605, in <module>
    main()
  File "/Users/richard.sanders/Documents/ulterior/1filellm/onefilellm.py", line 592, in main
    compressed_token_count = get_token_count(compressed_text)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/richard.sanders/Documents/ulterior/1filellm/onefilellm.py", line 236, in get_token_count
    tokens = enc.encode(text, disallowed_special=disallowed_special)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/richard.sanders/Documents/ulterior/1filellm/.venv/lib/python3.11/site-packages/tiktoken/core.py", line 124, in encode
    return self._core_bpe.encode(text, allowed_special)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)```

Suggest adding chunking during the `get_token_count` process.
jimmc414 commented 4 months ago

Can you provide the URL you were using?

jimmc414 commented 3 months ago

merged. Thanks for your work!