knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

CPU usage 300% #6

Closed PlexPt closed 1 year ago

PlexPt commented 1 year ago

code

  ExecutorService service = Executors.newFixedThreadPool(100);
        for (int i = 0; i < 100; i++) {
            service.execute(() -> tokens());
        }

   private static void tokens() {
       EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
           Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
        int tokens = enc.encode("Long text..... 600+ words").size();
   }
PlexPt commented 1 year ago

Using tiktoken to implement the same code will not have this problem

tox-p commented 1 year ago

I fail to see what the "problem" is. You are creating 100 platform threads, of course your CPU will spike accordingly for a CPU-bound task like encoding

The same happens in tiktoken, but I assume you used encode instead of the multithreaded encode_batch. Take a look in the benchmark directory to see parameters for a fair comparison

PlexPt commented 1 year ago

I noticed that it seems that my EncodingRegistry is not using the singleton pattern, which causes huge additional performance overhead

tox-p commented 1 year ago

You are creating a new encoding registry for every task, which will lead to the vocabularies being ingested for every thread. The EncodingRegistry (as well as the Encoding, btw) is fully thread-safe, so just create it once and share the instance in all of your threads