handle text that over 8192 tokens for GPT_embedding.py

sgt1796 commented 2 months ago

The following error appears when text to embedding input is over 8192 tokens:

ERROR:root:Chunk 30, Text Index 31: title: \u54f2\u7406\u5c0f\u6545\u4e8b\u5927\u5168|text: \u6765\u6e90\uff1a\u4e2d\u56fd\u513f\u7ae5\u6587\u5b66\u7f51\u3000\u3000\u4f5c\u8005\uff1a\u4f5a\u540d 1\u3001\u4e00\u5929\u665a\u4e0a\uff0c\u4e00\u7fa4\u6e38\u7267... | Error: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 17053 tokens (17053 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}} ERROR:root:Error while running: cannot unpack non-iterable NoneType object

ronaldlindev commented 2 months ago

thoughts on truncating long stories at 8192 tokens? Or is it preferred that we just implement some new error handling

sgt1796 commented 2 months ago

I find this article: https://cookbook.openai.com/examples/embedding_long_inputs#1-model-context-length

2 Common practices are

Simply truncate everything after reach max token (this will lose accuracy)
Divide input into smaller chunks, then use weighted average to turn them into one vector. Weights can be their token size.

sgt1796 commented 3 weeks ago

using jina segmenter API then take weighted mean of those chunks

sgt1796 / GPT_embedding

handle text that over 8192 tokens for GPT_embedding.py #9