Closed tigflanker closed 1 month ago
To find the document_id
and segment_id
in the Web UI or command line for updating a document within a knowledge base, you can use the following API calls:
List Documents in a Dataset:
document_id
for a specific document within a dataset.curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents' \
--header 'Authorization: Bearer {api_key}'
{
"data": [
{
"id": "document_id_here",
"position": 1,
"data_source_type": "file_upload",
"data_source_info": null,
"dataset_process_rule_id": null,
"name": "dify",
"created_from": "",
"created_by": "",
"created_at": 1681623639,
"tokens": 0,
"indexing_status": "waiting",
"error": null,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"archived": false
},
],
"has_more": false,
"limit": 20,
"total": 9,
"page": 1
}
List Segments of a Document:
segment_id
for a specific segment within a document.curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json'
{
"data": [{
"id": "segment_id_here",
"position": 1,
"document_id": "document_id_here",
"content": "1",
"answer": "1",
"word_count": 25,
"tokens": 0,
"keywords": [
"a"
],
"index_node_id": "",
"index_node_hash": "",
"hit_count": 0,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"status": "completed",
"created_by": "",
"created_at": 1695312007,
"indexing_at": 1695312007,
"completed_at": 1695312007,
"error": null,
"stopped_at": null
}],
"doc_form": "text_model"
}
Additionally, you can navigate to the knowledge base page in the Web UI and switch to the API ACCESS page from the left navigation. On this page, you can view the dataset API documentation provided by Dify and manage the credentials for accessing the dataset API in API Keys [1][2].
For more detailed information on the API endpoints and their usage, you can refer to the Dify Knowledge API documentation. Here are some relevant endpoints:
Create a document from text:
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "text",
"text": "text",
"indexing_technique": "high_quality",
"process_rule": {
"mode": "automatic"
}
}'
Update a document from text:
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/update_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "name",
"text": "text"
}'
Update a segment:
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments/{segment_id}' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"segment": {
"content": "1",
"answer": "1",
"keywords": ["a"],
"enabled": false
}
}'
To continue talking to Dosu, mention @dosu.
Great !!
@dosu Dosu,能否帮我找到更完整的参数列表?我在使用通过API文档上传更新知识库的功能,如:
--header 'Authorization: Bearer {api_key}' \
--form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain' \
--form 'file=@"/path/to/file"'
我期望找到其中form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain'
更全的参数文档说明,感谢
data
字段在 update_by_file
端点的 API 调用中可以包含以下参数:
original_document_id
(可选): 源文档 ID。用于重新上传文档或修改文档清理和分段配置。缺失的信息将从源文档中复制。源文档不能是已归档的文档。当传入 original_document_id
时,更新操作将代表文档执行。process_rule
是可填写项。如果未填写,将默认使用源文档的分段方法。当未传入 original_document_id
时,将代表文档执行新操作,并且 process_rule
是必需的。indexing_technique
: 索引模式。
high_quality
: 高质量:使用嵌入模型进行嵌入,构建为向量数据库索引。economy
: 经济:使用关键字表索引的倒排索引构建。process_rule
: 处理规则。
mode
(string): 清理、分段模式,自动 / 自定义。rules
(object): 自定义规则(在自动模式下,此字段为空)。pre_processing_rules
(array[object]): 预处理规则。
id
(string): 预处理规则的唯一标识符。remove_extra_spaces
: 替换连续的空格、换行符、制表符。remove_urls_emails
: 删除 URL、电子邮件地址。enabled
(bool): 是否选择此规则。如果未传入文档 ID,则表示默认值。segmentation
(object): 分段规则。
separator
: 自定义段标识符,目前只允许设置一个分隔符。默认是 \n。max_tokens
: 最大长度(标记),默认为 1000。name
(可选): 文档的名称。示例 cURL 请求:
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/update_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{\"name\":\"Dify\",\"indexing_technique\":\"high_quality\",\"process_rule\":{\"rules\":{\"pre_processing_rules\":[{\"id\":\"remove_extra_spaces\",\"enabled\":true},{\"id\":\"remove_urls_emails\",\"enabled\":true}],\"segmentation\":{\"separator\":\"###\",\"max_tokens\":500}},\"mode\":\"custom\"}}";type=text/plain' \
--form 'file=@"/path/to/file"'
响应示例:
{
"document": {
"id": "",
"position": 1,
"data_source_type": "upload_file",
"data_source_info": {
"upload_file_id": ""
},
"dataset_process_rule_id": "",
"name": "Dify.txt",
"created_from": "api",
"created_by": "",
"created_at": 1695308667,
"tokens": 0,
"indexing_status": "waiting",
"error": null,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"archived": false,
"display_status": "queuing",
"word_count": 0,
"hit_count": 0,
"doc_form": "text_model"
},
"batch": "20230921150427533684"
}
此外,还可以包含其他可选参数,如 indexing_technique
、process_rule
和 name
。例如:
def update_document_by_file(self, document_id, file_path, extra_params: dict = None):
"""
Update a document by file.
:param document_id: ID of the document
:param file_path: Path to the file
:param extra_params: extra parameters pass to the API, such as indexing_technique, process_rule, name. (optional)
e.g.
{
'name': 'Dify',
'indexing_technique': 'high_quality',
'process_rule': {
'rules': {
'pre_processing_rules': [
{'id': 'remove_extra_spaces', 'enabled': True},
{'id': 'remove_urls_emails', 'enabled': True}
],
'segmentation': {
'separator': '\n',
'max_tokens': 1000
}
},
'mode': 'custom'
}
}
:return:
"""
files = {"file": open(file_path, "rb")}
data = {}
if extra_params is not None and isinstance(extra_params, dict):
data.update(extra_params)
url = f"/datasets/{self._get_dataset_id()}/documents/{document_id}/update_by_file"
return self._send_request_with_files("POST", url, {"data": json.dumps(data)}, files)
在这个示例中,extra_params
可以包含 indexing_technique
、process_rule
和 name
作为可选参数。process_rule
可以进一步包含 mode
、rules
、pre_processing_rules
和 segmentation
,其中包含 separator
和 max_tokens
属性[1][2][3][4][5]。
To continue talking to Dosu, mention @dosu.
@dosu 非常溜!
Self Checks
Provide a description of requested docs changes
Hi, I am using the API functionality of the knowledge base and plan to update the content of a document within a knowledge base.
However, I don't know how to locate the document_id (and furthermore, how to locate the segment_id).
/datasets/{dataset_id}/documents/{document_id}/segments/{segment_id}
I know how to locate the dataset_id, as there is a query command that can be requested with
So, Is there a similar command or more refined expansion parameters that can be used to get the document_id?