how to split 1000 token chunk from a long text with 9999 token.

entroxu commented 1 year ago

how to split 1000 token text chunk from a long text with 9999 token.

tox-p commented 1 year ago

This depends heavily on your use case. For example, you may want to split on paragraph boundaries to retain the meaning of the paragraph. Take a look at the OpenAI Cookbook or the chatgpt-retrieval-plugin for some inspiration on how to chunk:

A naive first approach could be simply chunking the tokens without regard for text boundaries. But be aware that this may not yield optimal results when used for semantic search or similar.

final var enc = registry.getEncoding(...);
final var tokens = enc.encode(myText);
final var temporaryList = new ArrayList<Integer>();
for (final var token : tokens) {
    if (temporaryList.size() >= 1000) {
        // do something with your 1000 token chunk
        temporaryList.clear();
    }

    temporaryList.add(token);
}

// do something with the remainder of tokens in temporaryList

entroxu commented 1 year ago

This depends heavily on your use case. For example, you may want to split on paragraph boundaries to retain the meaning of the paragraph. Take a look at the OpenAI Cookbook or the chatgpt-retrieval-plugin for some inspiration on how to chunk:

https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py

A naive first approach could be simply chunking the tokens without regard for text boundaries. But be aware that this may not yield optimal results when used for semantic search or similar.
final var enc = registry.getEncoding(...);
final var tokens = enc.encode(myText);
final var temporaryList = new ArrayList<Integer>();
for (final var token : tokens) {
    if (temporaryList.size() >= 1000) {
        // do something with your 1000 token chunk
        temporaryList.clear();
    }

    temporaryList.add(token);
}

// do something with the remainder of tokens in temporaryList

thank you .I'm translating this page: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py

def get_text_chunks(……

into java code now

knuddelsgmbh / jtokkit

how to split 1000 token chunk from a long text with 9999 token. #1