Closed jafrank88 closed 1 month ago
To resolve the issue where some files remain stuck in the 'indexing' status when embedding 1000 files using the "Segmenting in Question & Answer format" in Dify software version 0.6.8, self-hosted with Docker, you should ensure that the Celery worker service is properly configured and running. The Celery worker is responsible for processing the queue, which includes tasks like indexing files.
Here are the steps to check and ensure the Celery worker is properly configured:
Check the Celery Worker Configuration:
Ensure that the worker
service is correctly defined in your docker-compose.yaml
file. The configuration should look like this:
worker:
image: langgenius/dify-api:0.6.8
restart: always
environment:
CONSOLE_WEB_URL: ''
MODE: worker
LOG_LEVEL: INFO
SECRET_KEY: sk-9f73s3ljTXVcMT3Blb3ljTqtsKiGHXVcMT3BlbkFJLK7U
DB_USERNAME: postgres
DB_PASSWORD: difyai123456
DB_HOST: db
DB_PORT: 5432
DB_DATABASE: dify
REDIS_HOST: redis
REDIS_PORT: 6379
REDIS_USERNAME: ''
REDIS_PASSWORD: difyai123456
REDIS_DB: 0
REDIS_USE_SSL: 'false'
CELERY_BROKER_URL: redis://:difyai123456@redis:6379/1
STORAGE_TYPE: local
STORAGE_LOCAL_PATH: storage
VECTOR_STORE: weaviate
WEAVIATE_ENDPOINT: http://weaviate:8080
WEAVIATE_API_KEY: WVF5YThaHlkYwhGUSmCRgsX3tD5ngdN8pkih
Ensure the Worker Service is Running:
Make sure the worker
service is up and running. You can check the status of the Docker containers by running:
docker-compose ps
If the worker
service is not running, start it with:
docker-compose up -d worker
Check Logs for Errors:
Inspect the logs of the worker
service to see if there are any errors that might be causing the files to remain stuck in the 'indexing' status. You can view the logs with:
docker-compose logs worker
Verify API Usage and Quotas: Ensure that the embedding model is properly configured and that you have not exceeded any quotas. You can use the Dify Knowledge API to check the status of document embeddings and manage your datasets. For example, to get the document embedding status, you can use the following API call:
curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{batch}/indexing-status' \
--header 'Authorization: Bearer {api_key}'
Additionally, check for any error messages such as provider_quota_exceeded
or provider_not_initialize
which indicate issues with the embedding model configuration or quota limits.
Check for Document Limits and Errors in the Indexing Task:
The document_indexing_task
function in api/tasks/document_indexing_task.py
includes checks for document limits and handles errors that might occur during the indexing process. Ensure that your batch upload does not exceed the configured limits and that your subscription allows for the number of documents you are trying to index. If an error occurs, the document's indexing_status
will be set to 'error' and the error message will be recorded.
By ensuring that the Celery worker is properly configured and running, verifying the API usage and quotas, and checking for document limits and errors in the indexing task, you should be able to resolve the issue with files remaining stuck in the 'indexing' status.
If the issue persists, you can manually trigger a retry for documents stuck in the 'indexing' status using the recover_document_indexing_task
function. Here is the relevant code snippet that shows how to manually trigger a retry for documents:
import logging
import time
import click
from celery import shared_task
from werkzeug.exceptions import NotFound
from core.indexing_runner import DocumentIsPausedException, IndexingRunner
from extensions.ext_database import db
from models.dataset import Document
@shared_task(queue='dataset')
def recover_document_indexing_task(dataset_id: str, document_id: str):
"""
Async recover document
:param dataset_id:
:param document_id:
Usage: recover_document_indexing_task.delay(dataset_id, document_id)
"""
logging.info(click.style('Recover document: {}'.format(document_id), fg='green'))
start_at = time.perf_counter()
document = db.session.query(Document).filter(
Document.id == document_id,
Document.dataset_id == dataset_id
).first()
if not document:
raise NotFound('Document not found')
try:
indexing_runner = IndexingRunner()
if document.indexing_status in ["waiting", "parsing", "cleaning"]:
indexing_runner.run([document])
elif document.indexing_status == "splitting":
indexing_runner.run_in_splitting_status(document)
elif document.indexing_status == "indexing":
indexing_runner.run_in_indexing_status(document)
end_at = time.perf_counter()
logging.info(click.style('Processed document: {} latency: {}'.format(document.id, end_at - start_at), fg='green'))
except DocumentIsPausedException as ex:
logging.info(click.style(str(ex), fg='yellow'))
except Exception:
pass
To manually trigger a retry for documents stuck in the 'indexing' status, you can use the following command:
recover_document_indexing_task.delay(dataset_id, document_id)
Replace dataset_id
with the ID of your dataset and document_id
with the ID of the document you want to retry.
Additionally, the Dify platform provides an API endpoint to retry indexing tasks for documents. You can use the DocumentRetryApi
to retry multiple documents by sending a POST request with the document_ids
in the request body. Here is an example of how to use this API:
class DocumentRetryApi(DocumentResource):
@setup_required
@login_required
@account_initialization_required
def post(self, dataset_id):
"""retry document."""
parser = reqparse.RequestParser()
parser.add_argument('document_ids', type=list, required=True, nullable=False,
location='json')
args = parser.parse_args()
dataset_id = str(dataset_id)
dataset = DatasetService.get_dataset(dataset_id)
retry_documents = []
if not dataset:
raise NotFound('Dataset not found.')
for document_id in args['document_ids']:
try:
document_id = str(document_id)
document = DocumentService.get_document(dataset.id, document_id)
# 404 if document not found
if document is None:
raise NotFound("Document Not Exists.")
# 403 if document is archived
if DocumentService.check_archived(document):
raise ArchivedDocumentImmutableError()
# 400 if document is completed
if document.indexing_status == 'completed':
raise DocumentAlreadyFinishedError()
retry_documents.append(document)
except Exception as e:
logging.error(f"Document {document_id} retry failed: {str(e)}")
continue
# retry document
DocumentService.retry_document(dataset_id, retry_documents)
return {'result': 'success'}, 204
You can call this API endpoint to retry indexing for multiple documents by sending a POST request to /datasets/<uuid:dataset_id>/retry
with the document_ids
in the request body.
To continue the conversation, mention @dosu.
Here is an example of the problem without Q-and-A embedding,
Hi - I am having a similar problem, not exactly the same as mentioned above but when trying to index 499 OCR pdf docs. Dify attempts to index but we get an error. When clicking on any of the docs with the error message, one can see that the processes has been attempted as there are example chunks from the document. It is not clear what the error is. Here are the settings being used.
Can you provide the logs of the container?
@crazywoola unfortunately not. My IT man is off and I don't hold server credentials. I am 99% sure that all docs are OCR searchable, but have you run in to issues where users have uploaded docs that have no text? How does Dify deal with these? Just skip them?
I posted my logs here and can provide more as needed - https://discord.com/channels/1082486657678311454/1237858420351041576
now its not liking docx - restarting server.
Just a theory! OpenAI invalidated the API key we were using and I set up a new key under a project. The API key uses an updated structure using an abirritation of the project name in the key "difi": Eg : sk-difi-XXXXXXXXXXetc...
We updated the Key in the only place available, but I do recall needing to provide an API key serval times for various OAI models during the setup. Is it possible that: a) The new key is not being used by the various OAI models b) The format of the new OAI key is conflicting with a validation check in the code?
I ran again using ada-002 and it indexed. Something up with text-embeddings-3-large - My guess is it is related to the new API key.
OK - I am outta here. Standing by for updates
I ran again using ada-002 and it indexed. Something up with text-embeddings-3-large - My guess is it is related to the new API key.
OK - I am outta here
I am running everything locally.
I am not seeing any text chunks after embedding using Q&A segmentation. Is it possible that feature is broken? When I turn it off and embed files, I can see the chunks and can retrieve them. I am not sure if Q&A is the issue or the way it sends the content to the embedding API is the issue.
Any update on this?
使用 Q&A 分段嵌入后,我没有看到任何文本块。该功能是否有可能被破坏?当我关闭它并嵌入文件时,我可以看到块并可以检索它们。我不确定是问答是问题还是将内容发送到嵌入 API 的方式是问题。
我也是相同的问题Q&A无法使用,看日志也没有报错
@dosu Does splitting indicate that embedding is being performed? My document processing progress has been 0 but index_status changed from waiting to splitting, I checked the logs of the worker in docker and there is no error and it shows Start process document: d77c40d1-4120-411a-8a4d- 6ea8520c0bca
@dosu Does splitting indicate that embedding is being performed? My document processing progress has been 0 but index_status changed from waiting to splitting, I checked the logs of the worker in docker and there is no error and it shows Start process document: d77c40d1-4120-411a-8a4d- 6ea8520c0bca
My problem was solved, it turned out to be just text-embedding-ada-002 processing was too slow, I was going to go to discord for help, when after pouring a glass of water, I found that the progress still started to change ....🤣
Self Checks
Dify version
0.6.8
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
Find 1000 files. Try to embed them when using "Segmenting in Question & Answer format". There is no way to tell when the process is complete, but when checking the log, it appears that some files get stuck with the status of 'indexing', rather than 'available'. See generally, https://discord.com/channels/1082486657678311454/1237858420351041576
✔️ Expected Behavior
All embedded RAG to be 'available' or to have a way to restart the process to fix those stuck in indexing status.
❌ Actual Behavior
All files 'available' after embedding.