Open kenneth-bro opened 2 weeks ago
To add segments to a document more quickly in Dify version 0.7.2 when self-hosted with Docker, you can use the dataset API's endpoint for adding segments. Here is the relevant API call:
curl --location 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data '{"segments": [{"content": "1","answer": "1","keywords": ["a"]}]}'
Replace {dataset_id}
with your dataset ID, {document_id}
with your document ID, and {api_key}
with your API key. The segments
array can contain multiple segment objects to add them in a batch, which can help speed up the process.
Additionally, Dify provides a batch import API for adding segments from a CSV file, which can further streamline the process:
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments/batch_import' \
--header 'Authorization: Bearer {api_key}' \
--form 'file=@/path/to/your/file.csv'
Ensure your CSV file is formatted correctly, with each row representing a segment. This method allows you to upload multiple segments at once, significantly speeding up the process.
For more advanced usage, such as creating documents by text or file, and managing document indexing, you can refer to the Dify API documentation. Here are some additional useful API calls:
Create a document by text:
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "Dify",
"text": "Dify means Do it for you...",
"indexing_technique": "high_quality",
"process_rule": {
"rules": {
"pre_processing_rules": [{
"id": "remove_extra_spaces",
"enabled": true
}, {
"id": "remove_urls_emails",
"enabled": true
}],
"segmentation": {
"separator": "###",
"max_tokens": 500
}
},
"mode": "custom"
}
}'
Create a document by file:
curl --location POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{
"name": "Dify",
"indexing_technique": "high_quality",
"process_rule": {
"rules": {
"pre_processing_rules": [{
"id": "remove_extra_spaces",
"enabled": true
}, {
"id": "remove_urls_emails",
"enabled": true
}],
"segmentation": {
"separator": "###",
"max_tokens": 500
}
},
"mode": "custom"
}
}"; type=text/plain' \
--form 'file=@"/path/to/file"'
These additional methods can help you manage your documents and segments more efficiently.
Moreover, Dify's DocumentService
and SegmentService
classes provide additional functionalities for handling documents and segments programmatically. For instance, the multi_create_segment
method in SegmentService
allows for the creation of multiple segments in a single operation, which can be particularly useful for batch processing:
@classmethod
def multi_create_segment(cls, segments: list, document: Document, dataset: Dataset):
lock_name = 'multi_add_segment_lock_document_id_{}'.format(document.id)
with redis_client.lock(lock_name, timeout=600):
embedding_model = None
if dataset.indexing_technique == 'high_quality':
model_manager = ModelManager()
embedding_model = model_manager.get_model_instance(
tenant_id=current_user.current_tenant_id,
provider=dataset.embedding_model_provider,
model_type=ModelType.TEXT_EMBEDDING,
model=dataset.embedding_model
)
max_position = db.session.query(func.max(DocumentSegment.position)).filter(
DocumentSegment.document_id == document.id
).scalar()
segment_data_list = []
for segment_item in segments:
content = segment_item['content']
doc_id = str(uuid.uuid4())
segment_hash = helper.generate_text_hash(content)
tokens = 0
if dataset.indexing_technique == 'high_quality' and embedding_model:
tokens = embedding_model.get_text_embedding_num_tokens(
texts=[content]
)
segment_document = DocumentSegment(
tenant_id=current_user.current_tenant_id,
dataset_id=document.dataset_id,
document_id=document.id,
index_node_id=doc_id,
index_node_hash=segment_hash,
position=max_position + 1 if max_position else 1,
content=content,
word_count=len(content),
tokens=tokens,
status='completed',
indexing_at=datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None),
completed_at=datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None),
created_by=current_user.id
)
if document.doc_form == 'qa_model':
segment_document.answer = segment_item['answer']
db.session.add(segment_document)
segment_data_list.append(segment_document)
db.session.commit()
return segment_data_list
Using these methods, you can efficiently manage and add segments to your documents in Dify [1][2][3][4].
To continue talking to Dosu, mention @dosu.
What LLM model and Embedding Model are you using?
What LLM model and Embedding Model are you using?
beg-large-zh-v1.5, use xinference build, other system is not problem.
In the 0.7.3 version of Dify the following API gives 404, indeed there is no the batch_import API in the documentation.
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments/batch_import' \ --header 'Authorization: Bearer {api_key}' \ --form 'file=@/path/to/your/file.csv'
Self Checks
Dify version
0.7.2
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
Use a data set API to write code to edit the knowledge base, so that the upload is too slow. 20 texts were uploaded for a while, so it was impossible to use API for business data for data set maintenance. DIFY-API and DIFY-works have no error.
✔️ Expected Behavior
fast and correct to put segment to document
❌ Actual Behavior
No response