pingcap / autoflow

pingcap/autoflow is a Graph RAG based and conversational knowledge base tool built with TiDB Serverless Vector Storage. Demo: https://tidb.ai
https://tidb.ai
Apache License 2.0
1.58k stars 76 forks source link

Build Vector Index is failing: Single text cannot exceed 8194 tokens. 8746 tokens given. #397

Open chethanuk opened 2 days ago

chethanuk commented 2 days ago

With new Jina Embed Model: jina-embeddings-v3

image

Ingest large PDF is erroring out

background-1  | [2024-11-23 13:48:08,045: ERROR/ForkPoolWorker-5] app.tasks.build_index.build_index_for_document[22305254-69a8-4ec7-bd97-bad0ce25f604]: Failed to build vector index for document 30001: Traceback (most recent call last):
background-1  |   File "/app/app/tasks/build_index.py", line 60, in build_index_for_document
background-1  |     index_service.build_vector_index_for_document(index_session, db_document)
background-1  |   File "/app/app/rag/build_index.py", line 72, in build_vector_index_for_document
background-1  |     vector_index.insert(document, source_uri=db_document.source_uri)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/base.py", line 215, in insert
background-1  |     self.insert_nodes(nodes, **insert_kwargs)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 330, in insert_nodes
background-1  |     self._insert(nodes, **insert_kwargs)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 311, in _insert
background-1  |     self._add_nodes_to_index(self._index_struct, nodes, **insert_kwargs)
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 231, in _add_nodes_to_index
background-1  |     nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
background-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 138, in _get_node_with_embedding
background-1  |     id_to_embed_map = embed_nodes(
background-1  |                       ^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/indices/utils.py", line 138, in embed_nodes
background-1  |     new_embeddings = embed_model.get_text_embedding_batch(
background-1  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py", line 265, in wrapper
background-1  |     result = func(*args, **kwargs)
background-1  |              ^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py", line 335, in get_text_embedding_batch
background-1  |     embeddings = self._get_text_embeddings(cur_batch)
background-1  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/embeddings/jinaai/base.py", line 202, in _get_text_embeddings
background-1  |     return self._api.get_embeddings(
background-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^
background-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/embeddings/jinaai/base.py", line 48, in get_embeddings
background-1  |     raise RuntimeError(resp["detail"])
background-1  | RuntimeError: Single text cannot exceed 8194 tokens. 8746 tokens given.
background-1  | 
background-1  | [2024-11-23 13:48:08,185: INFO/ForkPoolWorker-5] Task app.tasks.build_index.build_index_for_document[22305254-69a8-4ec7-bd97-bad0ce25f604] succeeded in 36.22360512241721s: None

Note in docker compose, max Value is already set

EMBEDDING_DIMS=1024
# EMBEDDING_MAX_TOKENS should be equal or smaller than the embedding model's max tokens,
# it indicates the max size of document chunks.
EMBEDDING_MAX_TOKENS=8191