Segments added through API get stucked at 'indexing' state.

racerxdl commented 1 month ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.12

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Create a document with any text (via API)
Add segments to it (via API)
Wait for the document to change from Queued/Indexing -> Available
Try enable/disable a added segment or check the API for segment status

✔️ Expected Behavior

Segments to be indexed/ready after some time, being able to enable/disable a segment.

❌ Actual Behavior

All segments stuck at "indexing" state.

This doesn't happen if I add the entire text to process when creating the document. But then the chunker messes it up and doesn't create the segments properly. So I manually separated the segments for QA and added them via API.

dosubot[bot] commented 1 month ago

Hey @racerxdl! 👋 I'm here to help you with bugs, questions, and becoming a contributor. Let's tackle this issue together!

To resolve the issue where segments added via API get stuck in the "indexing" state, you need to ensure that the indexing process completes successfully and updates the segment status to "completed". Here is a possible solution:

Ensure that the _load method is called after the segments are loaded.
Update the segment status to "completed" after the _load method completes.

Here is the relevant code snippet with the necessary changes:

def _load(self, index_processor: BaseIndexProcessor, dataset: Dataset,
          dataset_document: DatasetDocument, documents: list[Document]) -> None:
    """
    Insert index and update document/segment status to completed
    """
    embedding_model_instance = None
    if dataset.indexing_technique == 'high_quality':
        embedding_model_instance = self.model_manager.get_model_instance(
            tenant_id=dataset.tenant_id,
            provider=dataset.embedding_model_provider,
            model_type=ModelType.TEXT_EMBEDDING,
            model=dataset.embedding_model
        )

    # chunk nodes by chunk size
    indexing_start_at = time.perf_counter()
    tokens = 0
    chunk_size = 10

    # Assuming the indexing process happens here
    documents = index_processor.transform(text_docs, embedding_model_instance=embedding_model_instance,
                                          process_rule=process_rule, tenant_id=dataset.tenant_id,
                                          doc_language=doc_language)

    # Update segment status to completed
    self._update_segments_by_document(
        dataset_document_id=dataset_document.id,
        update_params={
            DocumentSegment.status: "completed",
            DocumentSegment.indexing_at: datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
        }
    )

    # Update document status to completed
    self._update_document_index_status(
        document_id=dataset_document.id,
        after_indexing_status="completed",
        extra_update_params={
            DatasetDocument.indexing_completed_at: datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
        }
    )

Additionally, you can use the duplicate_document_indexing_task function to handle retries and ensure that the indexing process completes successfully:

from duplicate_document_indexing_task import duplicate_document_indexing_task

# Retry indexing for a specific document ID
duplicate_document_indexing_task.delay(dataset_id, [document_id])

Make sure to call this _load method after the segments are loaded and ensure that the status is updated to "completed" once the indexing process is done. This should resolve the issue of segments getting stuck in the "indexing" state ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

JohnJyong commented 1 month ago

pls share the api curl , so we can check the real promble. @racerxdl

racerxdl commented 1 month ago

Sorry for the delay @JohnJyong, here is a small example:

curl -X POST https://dify/v1/datasets/b384e948-2a1a-4c6a-9379-e05094d54ef5/documents/26a7498f-9539-4de4-a894-7a96139dafc4/segments \
    -H "Authorization: Bearer dataset-XXXXXXXXXX" \
    -H "Content-Type: application/json" \
    -d '{
    "segments": [
        {
            "content": "Tell me about Spell - Precision Shot",
            "answer": "# Spell - Precision Shot\nChannels for 2.5 seconds to shoot a powerful arrow. Deals 300% Weapon Power weapon damage to the target. Casting time is reduced by 0.5 seconds per 10 Aether consumed.\n\n* Agressive: True\n* Archetype: Archery\n* Channeling: Yes\n* Cooldown: 30.0 seconds\n* Cost: 5% of mana\n* Needs Target: Yes\n* Skill Points: 3\n* Range: 6 tiles\n* Aether: -50\n* Self Target: No",
            "keywords": [
                "ravendawn",
                "skill",
                "spell",
                "precision shot"
            ]
        }
    ]
}'

crazywoola commented 2 weeks ago

It should be fix in latest version.

racerxdl commented 1 week ago

Hey @crazywoola which version it has been fixed? I'm testing on 0.7.3 and its still happens

crazywoola commented 1 week ago

Hello, it seems to be another problem, I will reopen this issue.

JohnJyong commented 1 week ago

There is a conflict between segment creation during document generation and create segment by api . New segment creation will be prohibited during document processing.

racerxdl commented 1 week ago

Just to be sure @crazywoola is the patch for this issue applied in 0.7.3? (Not the issue @JohnJyong mentioned). Because if I check the API, its still stuck at indexing:

Thats stuff I added after upgrading to 0.7.3

racerxdl commented 1 day ago

Seens to be working on 0.8.0

langgenius / dify