code is wrong - Githubissues

@heyb130 Hello there! I'm here to assist you with any bugs, questions, or contributions you may have. Let's work together to resolve this issue.

To resolve the issue where the if statement checking dataset.indexing_technique and embedding_model_type_instance results in no response or output when summing tokens from embedding_model_instance.get_tokens(document.page_content), you need to ensure that the embedding_model_instance is correctly instantiated and that the get_tokens method is properly called. Here is a possible solution:

Ensure that the embedding_model_instance is correctly instantiated.
Verify that the get_tokens method is called correctly and returns the expected output.

Here is a code snippet that demonstrates how to achieve this:

if dataset.indexing_technique == 'high_quality':
    try:
        model_manager = ModelManager()
        embedding_model_instance = model_manager.get_model_instance(
            tenant_id=dataset.tenant_id,
            provider=dataset.embedding_model_provider,
            model_type=ModelType.TEXT_EMBEDDING,
            model=dataset.embedding_model
        )

        # Ensure that the embedding_model_instance is not None
        if embedding_model_instance:
            tokens = embedding_model_instance.get_tokens(document.page_content)
            total_tokens = sum(tokens)
            print(f"Total tokens: {total_tokens}")
        else:
            print("Embedding model instance is None.")
    except LLMBadRequestError:
        raise ValueError("No Embedding Model available. Please configure a valid provider in the Settings -> Model Provider.")
    except ProviderTokenNotInitError as ex:
        raise ValueError(f"The dataset is unavailable, due to: {ex.description}")
else:
    print("Indexing technique is not 'high_quality'.")

This code ensures that:

The embedding_model_instance is correctly instantiated.
The get_tokens method is called on the embedding_model_instance.
The tokens are summed and printed.

Additionally, ensure that the embedding_model_instance is correctly handled in the _process_chunk method as shown in the indexing_runner.py file:

def _process_chunk(self, flask_app, index_processor, chunk_documents, dataset, dataset_document, embedding_model_instance):
    with flask_app.app_context():
        # check document is paused
        self._check_document_paused_status(dataset_document.id)

        tokens = 0
        if dataset.indexing_technique == 'high_quality' or embedding_model_instance:
            tokens += sum(
                embedding_model_instance.get_text_embedding_num_tokens(
                    [document.page_content]
                )
                for document in chunk_documents
            )

        # load index
        index_processor.load(dataset, chunk_documents, with_keywords=False)

This ensures that the tokens are correctly calculated and processed during the chunk processing phase ^[1]^[2].

To continue talking to Dosu, mention @dosu.

langgenius / dify

code is wrong #5605

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior