Closed entroxu closed 1 year ago
This depends heavily on your use case. For example, you may want to split on paragraph boundaries to retain the meaning of the paragraph. Take a look at the OpenAI Cookbook or the chatgpt-retrieval-plugin for some inspiration on how to chunk:
A naive first approach could be simply chunking the tokens without regard for text boundaries. But be aware that this may not yield optimal results when used for semantic search or similar.
final var enc = registry.getEncoding(...);
final var tokens = enc.encode(myText);
final var temporaryList = new ArrayList<Integer>();
for (final var token : tokens) {
if (temporaryList.size() >= 1000) {
// do something with your 1000 token chunk
temporaryList.clear();
}
temporaryList.add(token);
}
// do something with the remainder of tokens in temporaryList
This depends heavily on your use case. For example, you may want to split on paragraph boundaries to retain the meaning of the paragraph. Take a look at the OpenAI Cookbook or the chatgpt-retrieval-plugin for some inspiration on how to chunk:
- https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
- https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py
A naive first approach could be simply chunking the tokens without regard for text boundaries. But be aware that this may not yield optimal results when used for semantic search or similar.
final var enc = registry.getEncoding(...); final var tokens = enc.encode(myText); final var temporaryList = new ArrayList<Integer>(); for (final var token : tokens) { if (temporaryList.size() >= 1000) { // do something with your 1000 token chunk temporaryList.clear(); } temporaryList.add(token); } // do something with the remainder of tokens in temporaryList
thank you .I'm translating this page: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py
def get_text_chunks(……
into java code now
how to split 1000 token text chunk from a long text with 9999 token.