Open phazei opened 1 year ago
Hey, thanks for your suggestion! :) I thought about this when writing the library. For example, tiktoken uses a different mechanism that loads the required vocabulary files on-demand per HTTP request and then caches them locally. To prevent having the user of JTokkit specify a storage adapter, I opted for the classpath approach.
Your approach would also work nicely without needing local storage, but it would considerably increase the complexity of maintaining and using the library. There is a simple contract that the EncodingRegistry
follows: getEncoding(EncodingType)
guarantees to return a valid encoding. This contract would no longer hold with your proposed change.
JTokkit is more targeted toward server use than client use. In that case, I don't think optimizing for jar size yields a meaningful benefit.
Therefore, I wonder if making that change would be beneficial overall. Let's leave this issue open and see if there is any additional interest in this feature. I would especially be interested in hearing about real-world problems where the increased file size would be a blocker for using JTokkit in production.
I'm specifically using this in an Android app to help calculate tokens before sending messages and keeping track of total tokens used. Will also be used to calculate and manage message history size. There's no backend so library size is important for my use case. I just put out the first release a couple days ago. https://github.com/phazei/dynamicGPTChat
I saw this closed issue earlier: https://github.com/knuddelsgmbh/jtokkit/issues/23 This request is different.
I was thinking about this in terms of including the files in the project. If I'm only going to use one of the encodings, having the others adds an unneeded 2mb to my project.
Having builds along these lines:
implementation 'com.knuddels:jtokkit-full:0.5.0' implementation 'com.knuddels:jtokkit-bare:0.5.0' implementation 'com.knuddels:jtokkit-enc100k:0.5.0' implementation 'com.knuddels:jtokkit-enc50k:0.5.0'
Where "full" would be the current, and otherwise someone could have a choice to use both like so:
implementation 'com.knuddels:jtokkit-bare:0.5.0' //always required implementation 'com.knuddels:jtokkit-enc100k:0.5.0' //optional, as many individually as desired
Just an idea that would be nice