RAG stuck in embedding with status 'indexing' when using Segmenting in Q&A Format

jafrank88 commented 3 months ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.8

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Find 1000 files. Try to embed them when using "Segmenting in Question & Answer format". There is no way to tell when the process is complete, but when checking the log, it appears that some files get stuck with the status of 'indexing', rather than 'available'. See generally, https://discord.com/channels/1082486657678311454/1237858420351041576

✔️ Expected Behavior

All embedded RAG to be 'available' or to have a way to restart the process to fix those stuck in indexing status.

❌ Actual Behavior

All files 'available' after embedding.

dosubot[bot] commented 3 months ago

To resolve the issue where some files remain stuck in the 'indexing' status when embedding 1000 files using the "Segmenting in Question & Answer format" in Dify software version 0.6.8, self-hosted with Docker, you should ensure that the Celery worker service is properly configured and running. The Celery worker is responsible for processing the queue, which includes tasks like indexing files.

Here are the steps to check and ensure the Celery worker is properly configured:

Check the Celery Worker Configuration: Ensure that the worker service is correctly defined in your docker-compose.yaml file. The configuration should look like this:

worker:
  image: langgenius/dify-api:0.6.8
  restart: always
  environment:
    CONSOLE_WEB_URL: ''
    MODE: worker
    LOG_LEVEL: INFO
    SECRET_KEY: sk-9f73s3ljTXVcMT3Blb3ljTqtsKiGHXVcMT3BlbkFJLK7U
    DB_USERNAME: postgres
    DB_PASSWORD: difyai123456
    DB_HOST: db
    DB_PORT: 5432
    DB_DATABASE: dify
    REDIS_HOST: redis
    REDIS_PORT: 6379
    REDIS_USERNAME: ''
    REDIS_PASSWORD: difyai123456
    REDIS_DB: 0
    REDIS_USE_SSL: 'false'
    CELERY_BROKER_URL: redis://:difyai123456@redis:6379/1
    STORAGE_TYPE: local
    STORAGE_LOCAL_PATH: storage
    VECTOR_STORE: weaviate
    WEAVIATE_ENDPOINT: http://weaviate:8080
    WEAVIATE_API_KEY: WVF5YThaHlkYwhGUSmCRgsX3tD5ngdN8pkih

Ensure the Worker Service is Running: Make sure the worker service is up and running. You can check the status of the Docker containers by running:
```
docker-compose ps
```
If the worker service is not running, start it with:
```
docker-compose up -d worker
```
Check Logs for Errors: Inspect the logs of the worker service to see if there are any errors that might be causing the files to remain stuck in the 'indexing' status. You can view the logs with:
```
docker-compose logs worker
```
Verify API Usage and Quotas: Ensure that the embedding model is properly configured and that you have not exceeded any quotas. You can use the Dify Knowledge API to check the status of document embeddings and manage your datasets. For example, to get the document embedding status, you can use the following API call:
```
curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{batch}/indexing-status' \
--header 'Authorization: Bearer {api_key}'
```
Additionally, check for any error messages such as provider_quota_exceeded or provider_not_initialize which indicate issues with the embedding model configuration or quota limits.
Check for Document Limits and Errors in the Indexing Task: The document_indexing_task function in api/tasks/document_indexing_task.py includes checks for document limits and handles errors that might occur during the indexing process. Ensure that your batch upload does not exceed the configured limits and that your subscription allows for the number of documents you are trying to index. If an error occurs, the document's indexing_status will be set to 'error' and the error message will be recorded.

By ensuring that the Celery worker is properly configured and running, verifying the API usage and quotas, and checking for document limits and errors in the indexing task, you should be able to resolve the issue with files remaining stuck in the 'indexing' status.

If the issue persists, you can manually trigger a retry for documents stuck in the 'indexing' status using the recover_document_indexing_task function. Here is the relevant code snippet that shows how to manually trigger a retry for documents:

import logging
import time

import click
from celery import shared_task
from werkzeug.exceptions import NotFound

from core.indexing_runner import DocumentIsPausedException, IndexingRunner
from extensions.ext_database import db
from models.dataset import Document

@shared_task(queue='dataset')
def recover_document_indexing_task(dataset_id: str, document_id: str):
    """
    Async recover document
    :param dataset_id:
    :param document_id:

    Usage: recover_document_indexing_task.delay(dataset_id, document_id)
    """
    logging.info(click.style('Recover document: {}'.format(document_id), fg='green'))
    start_at = time.perf_counter()

    document = db.session.query(Document).filter(
        Document.id == document_id,
        Document.dataset_id == dataset_id
    ).first()

    if not document:
        raise NotFound('Document not found')

    try:
        indexing_runner = IndexingRunner()
        if document.indexing_status in ["waiting", "parsing", "cleaning"]:
            indexing_runner.run([document])
        elif document.indexing_status == "splitting":
            indexing_runner.run_in_splitting_status(document)
        elif document.indexing_status == "indexing":
            indexing_runner.run_in_indexing_status(document)
        end_at = time.perf_counter()
        logging.info(click.style('Processed document: {} latency: {}'.format(document.id, end_at - start_at), fg='green'))
    except DocumentIsPausedException as ex:
        logging.info(click.style(str(ex), fg='yellow'))
    except Exception:
        pass

To manually trigger a retry for documents stuck in the 'indexing' status, you can use the following command:

recover_document_indexing_task.delay(dataset_id, document_id)

Replace dataset_id with the ID of your dataset and document_id with the ID of the document you want to retry.

Additionally, the Dify platform provides an API endpoint to retry indexing tasks for documents. You can use the DocumentRetryApi to retry multiple documents by sending a POST request with the document_ids in the request body. Here is an example of how to use this API:

class DocumentRetryApi(DocumentResource):
    @setup_required
    @login_required
    @account_initialization_required
    def post(self, dataset_id):
        """retry document."""

        parser = reqparse.RequestParser()
        parser.add_argument('document_ids', type=list, required=True, nullable=False,
                            location='json')
        args = parser.parse_args()
        dataset_id = str(dataset_id)
        dataset = DatasetService.get_dataset(dataset_id)
        retry_documents = []
        if not dataset:
            raise NotFound('Dataset not found.')
        for document_id in args['document_ids']:
            try:
                document_id = str(document_id)

                document = DocumentService.get_document(dataset.id, document_id)

                # 404 if document not found
                if document is None:
                    raise NotFound("Document Not Exists.")

                # 403 if document is archived
                if DocumentService.check_archived(document):
                    raise ArchivedDocumentImmutableError()

                # 400 if document is completed
                if document.indexing_status == 'completed':
                    raise DocumentAlreadyFinishedError()
                retry_documents.append(document)
            except Exception as e:
                logging.error(f"Document {document_id} retry failed: {str(e)}")
                continue
        # retry document
        DocumentService.retry_document(dataset_id, retry_documents)

        return {'result': 'success'}, 204

You can call this API endpoint to retry indexing for multiple documents by sending a POST request to /datasets/<uuid:dataset_id>/retry with the document_ids in the request body.

#### Details

**Sources** - [cant delete file](https://github.com/langgenius/dify/issues/321) - [docker/docker-compose.yaml](https://github.com/langgenius/dify/blob/main/docker/docker-compose.yaml) - [en/features/datasets/maintain-dataset-via-api.md](https://github.com/langgenius/dify-docs/blob/main/en/features/datasets/maintain-dataset-via-api.md) - [api/tasks/document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/document_indexing_task.py) - [api/tasks/retry_document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/retry_document_indexing_task.py) - [api/tasks/recover_document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/recover_document_indexing_task.py) - [api/controllers/console/datasets/datasets_document.py](https://github.com/langgenius/dify/blob/main/api/controllers/console/datasets/datasets_document.py) - [api/tasks/batch_create_segment_to_index_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/batch_create_segment_to_index_task.py) - [api/config.py](https://github.com/langgenius/dify/blob/main/api/config.py) - [en/getting-started/install-self-hosted/environments.md](https://github.com/langgenius/dify-docs/blob/main/en/getting-started/install-self-hosted/environments.md) - [api/.vscode/launch.json](https://github.com/langgenius/dify/blob/main/api/.vscode/launch.json) - [en/getting-started/readme/specifications-and-technical-features.md](https://github.com/langgenius/dify-docs/blob/main/en/getting-started/readme/specifications-and-technical-features.md) - [api/core/indexing_runner.py](https://github.com/langgenius/dify/blob/main/api/core/indexing_runner.py) - [api/tasks/duplicate_document_indexing_task.py](https://github.com/langgenius/dify/blob/main/api/tasks/duplicate_document_indexing_task.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

jafrank88 commented 3 months ago

Here is an example of the problem without Q-and-A embedding,