langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
35.59k stars 4.8k forks source link

code is wrong #5605

Closed heyb130 closed 3 days ago

heyb130 commented 3 days ago

Self Checks

Dify version

main

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

1719369906670

✔️ Expected Behavior

6/10000 实时翻译 划译 solve the problem

❌ Actual Behavior

No response

dosubot[bot] commented 3 days ago

@heyb130 Hello there! I'm here to assist you with any bugs, questions, or contributions you may have. Let's work together to resolve this issue.

To resolve the issue where the if statement checking dataset.indexing_technique and embedding_model_type_instance results in no response or output when summing tokens from embedding_model_instance.get_tokens(document.page_content), you need to ensure that the embedding_model_instance is correctly instantiated and that the get_tokens method is properly called. Here is a possible solution:

  1. Ensure that the embedding_model_instance is correctly instantiated.
  2. Verify that the get_tokens method is called correctly and returns the expected output.

Here is a code snippet that demonstrates how to achieve this:

if dataset.indexing_technique == 'high_quality':
    try:
        model_manager = ModelManager()
        embedding_model_instance = model_manager.get_model_instance(
            tenant_id=dataset.tenant_id,
            provider=dataset.embedding_model_provider,
            model_type=ModelType.TEXT_EMBEDDING,
            model=dataset.embedding_model
        )

        # Ensure that the embedding_model_instance is not None
        if embedding_model_instance:
            tokens = embedding_model_instance.get_tokens(document.page_content)
            total_tokens = sum(tokens)
            print(f"Total tokens: {total_tokens}")
        else:
            print("Embedding model instance is None.")
    except LLMBadRequestError:
        raise ValueError("No Embedding Model available. Please configure a valid provider in the Settings -> Model Provider.")
    except ProviderTokenNotInitError as ex:
        raise ValueError(f"The dataset is unavailable, due to: {ex.description}")
else:
    print("Indexing technique is not 'high_quality'.")

This code ensures that:

Additionally, ensure that the embedding_model_instance is correctly handled in the _process_chunk method as shown in the indexing_runner.py file:

def _process_chunk(self, flask_app, index_processor, chunk_documents, dataset, dataset_document, embedding_model_instance):
    with flask_app.app_context():
        # check document is paused
        self._check_document_paused_status(dataset_document.id)

        tokens = 0
        if dataset.indexing_technique == 'high_quality' or embedding_model_instance:
            tokens += sum(
                embedding_model_instance.get_text_embedding_num_tokens(
                    [document.page_content]
                )
                for document in chunk_documents
            )

        # load index
        index_processor.load(dataset, chunk_documents, with_keywords=False)

This ensures that the tokens are correctly calculated and processed during the chunk processing phase [1][2].

To continue talking to Dosu, mention @dosu.