Closed racerxdl closed 1 week ago
Hey @racerxdl! 👋 I'm here to help you with bugs, questions, and becoming a contributor. Let's tackle this issue together!
To resolve the issue where segments added via API get stuck in the "indexing" state, you need to ensure that the indexing process completes successfully and updates the segment status to "completed". Here is a possible solution:
_load
method is called after the segments are loaded._load
method completes.Here is the relevant code snippet with the necessary changes:
def _load(self, index_processor: BaseIndexProcessor, dataset: Dataset,
dataset_document: DatasetDocument, documents: list[Document]) -> None:
"""
Insert index and update document/segment status to completed
"""
embedding_model_instance = None
if dataset.indexing_technique == 'high_quality':
embedding_model_instance = self.model_manager.get_model_instance(
tenant_id=dataset.tenant_id,
provider=dataset.embedding_model_provider,
model_type=ModelType.TEXT_EMBEDDING,
model=dataset.embedding_model
)
# chunk nodes by chunk size
indexing_start_at = time.perf_counter()
tokens = 0
chunk_size = 10
# Assuming the indexing process happens here
documents = index_processor.transform(text_docs, embedding_model_instance=embedding_model_instance,
process_rule=process_rule, tenant_id=dataset.tenant_id,
doc_language=doc_language)
# Update segment status to completed
self._update_segments_by_document(
dataset_document_id=dataset_document.id,
update_params={
DocumentSegment.status: "completed",
DocumentSegment.indexing_at: datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
}
)
# Update document status to completed
self._update_document_index_status(
document_id=dataset_document.id,
after_indexing_status="completed",
extra_update_params={
DatasetDocument.indexing_completed_at: datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
}
)
Additionally, you can use the duplicate_document_indexing_task
function to handle retries and ensure that the indexing process completes successfully:
from duplicate_document_indexing_task import duplicate_document_indexing_task
# Retry indexing for a specific document ID
duplicate_document_indexing_task.delay(dataset_id, [document_id])
Make sure to call this _load
method after the segments are loaded and ensure that the status is updated to "completed" once the indexing process is done. This should resolve the issue of segments getting stuck in the "indexing" state [1][2][3][4].
To continue talking to Dosu, mention @dosu.
pls share the api curl , so we can check the real promble. @racerxdl
Sorry for the delay @JohnJyong, here is a small example:
curl -X POST https://dify/v1/datasets/b384e948-2a1a-4c6a-9379-e05094d54ef5/documents/26a7498f-9539-4de4-a894-7a96139dafc4/segments \
-H "Authorization: Bearer dataset-XXXXXXXXXX" \
-H "Content-Type: application/json" \
-d '{
"segments": [
{
"content": "Tell me about Spell - Precision Shot",
"answer": "# Spell - Precision Shot\nChannels for 2.5 seconds to shoot a powerful arrow. Deals 300% Weapon Power weapon damage to the target. Casting time is reduced by 0.5 seconds per 10 Aether consumed.\n\n* Agressive: True\n* Archetype: Archery\n* Channeling: Yes\n* Cooldown: 30.0 seconds\n* Cost: 5% of mana\n* Needs Target: Yes\n* Skill Points: 3\n* Range: 6 tiles\n* Aether: -50\n* Self Target: No",
"keywords": [
"ravendawn",
"skill",
"spell",
"precision shot"
]
}
]
}'
It should be fix in latest version.
Hey @crazywoola which version it has been fixed? I'm testing on 0.7.3 and its still happens
Hello, it seems to be another problem, I will reopen this issue.
There is a conflict between segment creation during document generation and create segment by api . New segment creation will be prohibited during document processing.
Just to be sure @crazywoola is the patch for this issue applied in 0.7.3? (Not the issue @JohnJyong mentioned). Because if I check the API, its still stuck at indexing:
Thats stuff I added after upgrading to 0.7.3
Seens to be working on 0.8.0
Self Checks
Dify version
0.6.12
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
✔️ Expected Behavior
Segments to be indexed/ready after some time, being able to enable/disable a segment.
❌ Actual Behavior
All segments stuck at "indexing" state.
This doesn't happen if I add the entire text to process when creating the document. But then the chunker messes it up and doesn't create the segments properly. So I manually separated the segments for QA and added them via API.