langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
44.46k stars 6.22k forks source link

RAG stuck in embedding with status 'indexing' when using Segmenting in Q&A Format #4564

Closed jafrank88 closed 1 month ago

jafrank88 commented 3 months ago

Self Checks

Dify version

0.6.8

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Find 1000 files. Try to embed them when using "Segmenting in Question & Answer format". There is no way to tell when the process is complete, but when checking the log, it appears that some files get stuck with the status of 'indexing', rather than 'available'. See generally, https://discord.com/channels/1082486657678311454/1237858420351041576

✔️ Expected Behavior

All embedded RAG to be 'available' or to have a way to restart the process to fix those stuck in indexing status.

❌ Actual Behavior

All files 'available' after embedding.

dosubot[bot] commented 3 months ago

To resolve the issue where some files remain stuck in the 'indexing' status when embedding 1000 files using the "Segmenting in Question & Answer format" in Dify software version 0.6.8, self-hosted with Docker, you should ensure that the Celery worker service is properly configured and running. The Celery worker is responsible for processing the queue, which includes tasks like indexing files.

Here are the steps to check and ensure the Celery worker is properly configured:

  1. Check the Celery Worker Configuration: Ensure that the worker service is correctly defined in your docker-compose.yaml file. The configuration should look like this:

    worker:
      image: langgenius/dify-api:0.6.8
      restart: always
      environment:
        CONSOLE_WEB_URL: ''
        MODE: worker
        LOG_LEVEL: INFO
        SECRET_KEY: sk-9f73s3ljTXVcMT3Blb3ljTqtsKiGHXVcMT3BlbkFJLK7U
        DB_USERNAME: postgres
        DB_PASSWORD: difyai123456
        DB_HOST: db
        DB_PORT: 5432
        DB_DATABASE: dify
        REDIS_HOST: redis
        REDIS_PORT: 6379
        REDIS_USERNAME: ''
        REDIS_PASSWORD: difyai123456
        REDIS_DB: 0
        REDIS_USE_SSL: 'false'
        CELERY_BROKER_URL: redis://:difyai123456@redis:6379/1
        STORAGE_TYPE: local
        STORAGE_LOCAL_PATH: storage
        VECTOR_STORE: weaviate
        WEAVIATE_ENDPOINT: http://weaviate:8080
        WEAVIATE_API_KEY: WVF5YThaHlkYwhGUSmCRgsX3tD5ngdN8pkih
  2. Ensure the Worker Service is Running: Make sure the worker service is up and running. You can check the status of the Docker containers by running:

    docker-compose ps

    If the worker service is not running, start it with:

    docker-compose up -d worker
  3. Check Logs for Errors: Inspect the logs of the worker service to see if there are any errors that might be causing the files to remain stuck in the 'indexing' status. You can view the logs with:

    docker-compose logs worker
  4. Verify API Usage and Quotas: Ensure that the embedding model is properly configured and that you have not exceeded any quotas. You can use the Dify Knowledge API to check the status of document embeddings and manage your datasets. For example, to get the document embedding status, you can use the following API call:

    curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{batch}/indexing-status' \
    --header 'Authorization: Bearer {api_key}'

    Additionally, check for any error messages such as provider_quota_exceeded or provider_not_initialize which indicate issues with the embedding model configuration or quota limits.

  5. Check for Document Limits and Errors in the Indexing Task: The document_indexing_task function in api/tasks/document_indexing_task.py includes checks for document limits and handles errors that might occur during the indexing process. Ensure that your batch upload does not exceed the configured limits and that your subscription allows for the number of documents you are trying to index. If an error occurs, the document's indexing_status will be set to 'error' and the error message will be recorded.

By ensuring that the Celery worker is properly configured and running, verifying the API usage and quotas, and checking for document limits and errors in the indexing task, you should be able to resolve the issue with files remaining stuck in the 'indexing' status.

If the issue persists, you can manually trigger a retry for documents stuck in the 'indexing' status using the recover_document_indexing_task function. Here is the relevant code snippet that shows how to manually trigger a retry for documents:

import logging
import time

import click
from celery import shared_task
from werkzeug.exceptions import NotFound

from core.indexing_runner import DocumentIsPausedException, IndexingRunner
from extensions.ext_database import db
from models.dataset import Document

@shared_task(queue='dataset')
def recover_document_indexing_task(dataset_id: str, document_id: str):
    """
    Async recover document
    :param dataset_id:
    :param document_id:

    Usage: recover_document_indexing_task.delay(dataset_id, document_id)
    """
    logging.info(click.style('Recover document: {}'.format(document_id), fg='green'))
    start_at = time.perf_counter()

    document = db.session.query(Document).filter(
        Document.id == document_id,
        Document.dataset_id == dataset_id
    ).first()

    if not document:
        raise NotFound('Document not found')

    try:
        indexing_runner = IndexingRunner()
        if document.indexing_status in ["waiting", "parsing", "cleaning"]:
            indexing_runner.run([document])
        elif document.indexing_status == "splitting":
            indexing_runner.run_in_splitting_status(document)
        elif document.indexing_status == "indexing":
            indexing_runner.run_in_indexing_status(document)
        end_at = time.perf_counter()
        logging.info(click.style('Processed document: {} latency: {}'.format(document.id, end_at - start_at), fg='green'))
    except DocumentIsPausedException as ex:
        logging.info(click.style(str(ex), fg='yellow'))
    except Exception:
        pass

To manually trigger a retry for documents stuck in the 'indexing' status, you can use the following command:

recover_document_indexing_task.delay(dataset_id, document_id)

Replace dataset_id with the ID of your dataset and document_id with the ID of the document you want to retry.

Additionally, the Dify platform provides an API endpoint to retry indexing tasks for documents. You can use the DocumentRetryApi to retry multiple documents by sending a POST request with the document_ids in the request body. Here is an example of how to use this API:

class DocumentRetryApi(DocumentResource):
    @setup_required
    @login_required
    @account_initialization_required
    def post(self, dataset_id):
        """retry document."""

        parser = reqparse.RequestParser()
        parser.add_argument('document_ids', type=list, required=True, nullable=False,
                            location='json')
        args = parser.parse_args()
        dataset_id = str(dataset_id)
        dataset = DatasetService.get_dataset(dataset_id)
        retry_documents = []
        if not dataset:
            raise NotFound('Dataset not found.')
        for document_id in args['document_ids']:
            try:
                document_id = str(document_id)

                document = DocumentService.get_document(dataset.id, document_id)

                # 404 if document not found
                if document is None:
                    raise NotFound("Document Not Exists.")

                # 403 if document is archived
                if DocumentService.check_archived(document):
                    raise ArchivedDocumentImmutableError()

                # 400 if document is completed
                if document.indexing_status == 'completed':
                    raise DocumentAlreadyFinishedError()
                retry_documents.append(document)
            except Exception as e:
                logging.error(f"Document {document_id} retry failed: {str(e)}")
                continue
        # retry document
        DocumentService.retry_document(dataset_id, retry_documents)

        return {'result': 'success'}, 204

You can call this API endpoint to retry indexing for multiple documents by sending a POST request to /datasets/<uuid:dataset_id>/retry with the document_ids in the request body.

#### Details **Sources** - [cant delete file](https://github.com/langgenius/dify/issues/321) - [docker/docker-compose.yaml](https://github.com/langgenius/dify/blob/main/docker/docker-compose.yaml) - [en/features/datasets/maintain-dataset-via-api.md](https://github.com/langgenius/dify-docs/blob/main/en/features/datasets/maintain-dataset-via-api.md) - [api/tasks/document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/document_indexing_task.py) - [api/tasks/retry_document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/retry_document_indexing_task.py) - [api/tasks/recover_document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/recover_document_indexing_task.py) - [api/controllers/console/datasets/datasets_document.py](https://github.com/langgenius/dify/blob/main/api/controllers/console/datasets/datasets_document.py) - [api/tasks/batch_create_segment_to_index_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/batch_create_segment_to_index_task.py) - [api/config.py](https://github.com/langgenius/dify/blob/main/api/config.py) - [en/getting-started/install-self-hosted/environments.md](https://github.com/langgenius/dify-docs/blob/main/en/getting-started/install-self-hosted/environments.md) - [api/.vscode/launch.json](https://github.com/langgenius/dify/blob/main/api/.vscode/launch.json) - [en/getting-started/readme/specifications-and-technical-features.md](https://github.com/langgenius/dify-docs/blob/main/en/getting-started/readme/specifications-and-technical-features.md) - [api/core/indexing_runner.py](https://github.com/langgenius/dify/blob/main/api/core/indexing_runner.py) - [api/tasks/duplicate_document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/duplicate_document_indexing_task.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

jafrank88 commented 3 months ago

Here is an example of the problem without Q-and-A embedding, image

GitDakky commented 3 months ago

Hi - I am having a similar problem, not exactly the same as mentioned above but when trying to index 499 OCR pdf docs. Dify attempts to index but we get an error. When clicking on any of the docs with the error message, one can see that the processes has been attempted as there are example chunks from the document. It is not clear what the error is. Here are the settings being used. image

crazywoola commented 3 months ago

Can you provide the logs of the container?

GitDakky commented 3 months ago

@crazywoola unfortunately not. My IT man is off and I don't hold server credentials. I am 99% sure that all docs are OCR searchable, but have you run in to issues where users have uploaded docs that have no text? How does Dify deal with these? Just skip them?

jafrank88 commented 3 months ago

I posted my logs here and can provide more as needed - https://discord.com/channels/1082486657678311454/1237858420351041576

GitDakky commented 3 months ago

now its not liking docx - restarting server. image

GitDakky commented 3 months ago

Just a theory! OpenAI invalidated the API key we were using and I set up a new key under a project. The API key uses an updated structure using an abirritation of the project name in the key "difi": Eg : sk-difi-XXXXXXXXXXetc...

We updated the Key in the only place available, but I do recall needing to provide an API key serval times for various OAI models during the setup. Is it possible that: a) The new key is not being used by the various OAI models b) The format of the new OAI key is conflicting with a validation check in the code?

image

GitDakky commented 3 months ago

I ran again using ada-002 and it indexed. Something up with text-embeddings-3-large - My guess is it is related to the new API key.

OK - I am outta here. Standing by for updates

GitDakky commented 3 months ago

I ran again using ada-002 and it indexed. Something up with text-embeddings-3-large - My guess is it is related to the new API key.

OK - I am outta here

jafrank88 commented 3 months ago

I am running everything locally.

jafrank88 commented 3 months ago

I am not seeing any text chunks after embedding using Q&A segmentation. Is it possible that feature is broken? When I turn it off and embed files, I can see the chunks and can retrieve them. I am not sure if Q&A is the issue or the way it sends the content to the embedding API is the issue.

GitDakky commented 2 months ago

Any update on this?

ouyang-yuxuan commented 2 months ago

使用 Q&A 分段嵌入后,我没有看到任何文本块。该功能是否有可能被破坏?当我关闭它并嵌入文件时,我可以看到块并可以检索它们。我不确定是问答是问题还是将内容发送到嵌入 API 的方式是问题。

我也是相同的问题Q&A无法使用,看日志也没有报错

supuwoerc commented 1 month ago

@dosu Does splitting indicate that embedding is being performed? My document processing progress has been 0 but index_status changed from waiting to splitting, I checked the logs of the worker in docker and there is no error and it shows Start process document: d77c40d1-4120-411a-8a4d- 6ea8520c0bca

supuwoerc commented 1 month ago

@dosu Does splitting indicate that embedding is being performed? My document processing progress has been 0 but index_status changed from waiting to splitting, I checked the logs of the worker in docker and there is no error and it shows Start process document: d77c40d1-4120-411a-8a4d- 6ea8520c0bca

My problem was solved, it turned out to be just text-embedding-ada-002 processing was too slow, I was going to go to discord for help, when after pouring a glass of water, I found that the progress still started to change ....🤣