knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

how to split 1000 token chunk from a long text with 9999 token. #1

Closed entroxu closed 1 year ago

entroxu commented 1 year ago

how to split 1000 token text chunk from a long text with 9999 token.

tox-p commented 1 year ago

This depends heavily on your use case. For example, you may want to split on paragraph boundaries to retain the meaning of the paragraph. Take a look at the OpenAI Cookbook or the chatgpt-retrieval-plugin for some inspiration on how to chunk:

A naive first approach could be simply chunking the tokens without regard for text boundaries. But be aware that this may not yield optimal results when used for semantic search or similar.

final var enc = registry.getEncoding(...);
final var tokens = enc.encode(myText);
final var temporaryList = new ArrayList<Integer>();
for (final var token : tokens) {
    if (temporaryList.size() >= 1000) {
        // do something with your 1000 token chunk
        temporaryList.clear();
    }

    temporaryList.add(token);
}

// do something with the remainder of tokens in temporaryList
entroxu commented 1 year ago

This depends heavily on your use case. For example, you may want to split on paragraph boundaries to retain the meaning of the paragraph. Take a look at the OpenAI Cookbook or the chatgpt-retrieval-plugin for some inspiration on how to chunk:

A naive first approach could be simply chunking the tokens without regard for text boundaries. But be aware that this may not yield optimal results when used for semantic search or similar.

final var enc = registry.getEncoding(...);
final var tokens = enc.encode(myText);
final var temporaryList = new ArrayList<Integer>();
for (final var token : tokens) {
    if (temporaryList.size() >= 1000) {
        // do something with your 1000 token chunk
        temporaryList.clear();
    }

    temporaryList.add(token);
}

// do something with the remainder of tokens in temporaryList

thank you .I'm translating this page: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py

def get_text_chunks(……

into java code now