Closed heyb130 closed 3 days ago
@heyb130 Hello there! I'm here to assist you with any bugs, questions, or contributions you may have. Let's work together to resolve this issue.
To resolve the issue where the if
statement checking dataset.indexing_technique
and embedding_model_type_instance
results in no response or output when summing tokens from embedding_model_instance.get_tokens(document.page_content)
, you need to ensure that the embedding_model_instance
is correctly instantiated and that the get_tokens
method is properly called. Here is a possible solution:
embedding_model_instance
is correctly instantiated.get_tokens
method is called correctly and returns the expected output.Here is a code snippet that demonstrates how to achieve this:
if dataset.indexing_technique == 'high_quality':
try:
model_manager = ModelManager()
embedding_model_instance = model_manager.get_model_instance(
tenant_id=dataset.tenant_id,
provider=dataset.embedding_model_provider,
model_type=ModelType.TEXT_EMBEDDING,
model=dataset.embedding_model
)
# Ensure that the embedding_model_instance is not None
if embedding_model_instance:
tokens = embedding_model_instance.get_tokens(document.page_content)
total_tokens = sum(tokens)
print(f"Total tokens: {total_tokens}")
else:
print("Embedding model instance is None.")
except LLMBadRequestError:
raise ValueError("No Embedding Model available. Please configure a valid provider in the Settings -> Model Provider.")
except ProviderTokenNotInitError as ex:
raise ValueError(f"The dataset is unavailable, due to: {ex.description}")
else:
print("Indexing technique is not 'high_quality'.")
This code ensures that:
embedding_model_instance
is correctly instantiated.get_tokens
method is called on the embedding_model_instance
.Additionally, ensure that the embedding_model_instance
is correctly handled in the _process_chunk
method as shown in the indexing_runner.py
file:
def _process_chunk(self, flask_app, index_processor, chunk_documents, dataset, dataset_document, embedding_model_instance):
with flask_app.app_context():
# check document is paused
self._check_document_paused_status(dataset_document.id)
tokens = 0
if dataset.indexing_technique == 'high_quality' or embedding_model_instance:
tokens += sum(
embedding_model_instance.get_text_embedding_num_tokens(
[document.page_content]
)
for document in chunk_documents
)
# load index
index_processor.load(dataset, chunk_documents, with_keywords=False)
This ensures that the tokens are correctly calculated and processed during the chunk processing phase [1][2].
To continue talking to Dosu, mention @dosu.
Self Checks
Dify version
main
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
✔️ Expected Behavior
6/10000 实时翻译 划译 solve the problem
❌ Actual Behavior
No response