knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

Way to only include encodings needed #29

Open phazei opened 1 year ago

phazei commented 1 year ago

I saw this closed issue earlier: https://github.com/knuddelsgmbh/jtokkit/issues/23 This request is different.

I was thinking about this in terms of including the files in the project. If I'm only going to use one of the encodings, having the others adds an unneeded 2mb to my project.

Having builds along these lines:

implementation 'com.knuddels:jtokkit-full:0.5.0' implementation 'com.knuddels:jtokkit-bare:0.5.0' implementation 'com.knuddels:jtokkit-enc100k:0.5.0' implementation 'com.knuddels:jtokkit-enc50k:0.5.0'

Where "full" would be the current, and otherwise someone could have a choice to use both like so:

implementation 'com.knuddels:jtokkit-bare:0.5.0' //always required implementation 'com.knuddels:jtokkit-enc100k:0.5.0' //optional, as many individually as desired

Just an idea that would be nice

tox-p commented 1 year ago

Hey, thanks for your suggestion! :) I thought about this when writing the library. For example, tiktoken uses a different mechanism that loads the required vocabulary files on-demand per HTTP request and then caches them locally. To prevent having the user of JTokkit specify a storage adapter, I opted for the classpath approach.

Your approach would also work nicely without needing local storage, but it would considerably increase the complexity of maintaining and using the library. There is a simple contract that the EncodingRegistry follows: getEncoding(EncodingType) guarantees to return a valid encoding. This contract would no longer hold with your proposed change.

JTokkit is more targeted toward server use than client use. In that case, I don't think optimizing for jar size yields a meaningful benefit.

Therefore, I wonder if making that change would be beneficial overall. Let's leave this issue open and see if there is any additional interest in this feature. I would especially be interested in hearing about real-world problems where the increased file size would be a blocker for using JTokkit in production.

phazei commented 1 year ago

I'm specifically using this in an Android app to help calculate tokens before sending messages and keeping track of total tokens used. Will also be used to calculate and manage message history size. There's no backend so library size is important for my use case. I just put out the first release a couple days ago. https://github.com/phazei/dynamicGPTChat