knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
554 stars 42 forks source link

Is there a way to only load the encoding needed? #23

Closed blackdiz closed 1 year ago

blackdiz commented 1 year ago

Hey there, thanks for your hard work. We're interested in using this library on mobile, but we noticed that the initialization process takes some time. We dug into the code and saw that DefaultEncodingRegistry.initializeDefaultEncodings() loads all the encodings. We only require the r50k_base.tiktoken encoding, so is there a way to load just that one and speed up the initialization?

tox-p commented 1 year ago

Currently not, but I would be open to adding such a functionality

Would adding a new LazyEncodingRegistry that does not initialize any encodings on construction but does so lazily at first getEncoding call for that encoding fit your needs?

blackdiz commented 1 year ago

Sure, that would be very appreciated!

tox-p commented 1 year ago

I am currently a little bit busy :) Could you open a PR with the change? If I am not mistaken it should be pretty straightforward, just extracting the common functionality of DefaultEncodingRegistry into an AbstractEncodingRegistry, renaming the DefaultEncodingRegistry to EagerEncodingRegistry, creating the LazyEncodingRegistry alongside it and exposing it via a new newLazyEncodingFactory in the Encodings class

blackdiz commented 1 year ago

OK, I don't have the experience to contribute to open-source projects, but I'll give it a try.

tox-p commented 1 year ago

Thanks for the implementation :blush: This feature is released as part of 0.5.0 and should soon be available on maven central