knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
516 stars 38 forks source link

Added Encoding.calcCharCountForTokens method #81

Closed dimafa closed 3 months ago

dimafa commented 5 months ago

A very common use case for token counting is when chunking a long text to fit in a model context window. In order to efficiently use jtokkit library for this purpose, we need to be able to count number of characters for given token count. I added Encoding.calcCharCountForTokens method that does that. Please, review and accept the pull request if it makes sense.

tox-p commented 5 months ago

Hey, we are currently in the process of optimizing the current implementation, which probably will lead to your PR being out-of-date once those changes are merged. Since those changes are rather high-impact, I would want to get the optimizations merged before integrating your proposal

I will get back to you, once the optimizations are done