How to find document_id and segment_id in Web UI or command line?

tigflanker commented 1 month ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Provide a description of requested docs changes

Hi, I am using the API functionality of the knowledge base and plan to update the content of a document within a knowledge base.

However, I don't know how to locate the document_id (and furthermore, how to locate the segment_id). /datasets/{dataset_id}/documents/{document_id}/segments/{segment_id}

I know how to locate the dataset_id, as there is a query command that can be requested with

curl --location --request GET 'http://xxx/v1/datasets?page=1&limit=20'

So, Is there a similar command or more refined expansion parameters that can be used to get the document_id?

dosubot[bot] commented 1 month ago

To find the document_id and segment_id in the Web UI or command line for updating a document within a knowledge base, you can use the following API calls:

List Documents in a Dataset:

This will help you find the document_id for a specific document within a dataset.

Command:

curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents' \
--header 'Authorization: Bearer {api_key}'

Example Output:

{
"data": [
 {
   "id": "document_id_here",
   "position": 1,
   "data_source_type": "file_upload",
   "data_source_info": null,
   "dataset_process_rule_id": null,
   "name": "dify",
   "created_from": "",
   "created_by": "",
   "created_at": 1681623639,
   "tokens": 0,
   "indexing_status": "waiting",
   "error": null,
   "enabled": true,
   "disabled_at": null,
   "disabled_by": null,
   "archived": false
 },
],
"has_more": false,
"limit": 20,
"total": 9,
"page": 1
}

List Segments of a Document:

This will help you find the segment_id for a specific segment within a document.

Command:

curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json'

Example Output:

{
"data": [{
 "id": "segment_id_here",
 "position": 1,
 "document_id": "document_id_here",
 "content": "1",
 "answer": "1",
 "word_count": 25,
 "tokens": 0,
 "keywords": [
     "a"
 ],
 "index_node_id": "",
 "index_node_hash": "",
 "hit_count": 0,
 "enabled": true,
 "disabled_at": null,
 "disabled_by": null,
 "status": "completed",
 "created_by": "",
 "created_at": 1695312007,
 "indexing_at": 1695312007,
 "completed_at": 1695312007,
 "error": null,
 "stopped_at": null
}],
"doc_form": "text_model"
}

Additionally, you can navigate to the knowledge base page in the Web UI and switch to the API ACCESS page from the left navigation. On this page, you can view the dataset API documentation provided by Dify and manage the credentials for accessing the dataset API in API Keys ^[1]^[2].

For more detailed information on the API endpoints and their usage, you can refer to the Dify Knowledge API documentation. Here are some relevant endpoints:

Create a document from text:

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
  "name": "text",
  "text": "text",
  "indexing_technique": "high_quality",
  "process_rule": {
      "mode": "automatic"
  }
}'

Update a document from text:

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/update_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
  "name": "name",
  "text": "text"
}'

Update a segment:

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments/{segment_id}' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"segment": {
    "content": "1",
    "answer": "1",
    "keywords": ["a"],
    "enabled": false
}
}'

To continue talking to Dosu, mention @dosu.

tigflanker commented 1 month ago

Great !!

tigflanker commented 1 month ago

@dosu Dosu，能否帮我找到更完整的参数列表？我在使用通过API文档上传更新知识库的功能，如：

--header 'Authorization: Bearer {api_key}' \
--form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain' \
--form 'file=@"/path/to/file"'

我期望找到其中form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain'更全的参数文档说明，感谢

dosubot[bot] commented 1 month ago

data 字段在 update_by_file 端点的 API 调用中可以包含以下参数：

original_document_id (可选): 源文档 ID。用于重新上传文档或修改文档清理和分段配置。缺失的信息将从源文档中复制。源文档不能是已归档的文档。当传入 original_document_id 时，更新操作将代表文档执行。process_rule 是可填写项。如果未填写，将默认使用源文档的分段方法。当未传入 original_document_id 时，将代表文档执行新操作，并且 process_rule 是必需的。
indexing_technique: 索引模式。
- high_quality: 高质量：使用嵌入模型进行嵌入，构建为向量数据库索引。
- economy: 经济：使用关键字表索引的倒排索引构建。
process_rule: 处理规则。
- mode (string): 清理、分段模式，自动 / 自定义。
- rules (object): 自定义规则（在自动模式下，此字段为空）。
- pre_processing_rules (array[object]): 预处理规则。
  - id (string): 预处理规则的唯一标识符。
  - remove_extra_spaces: 替换连续的空格、换行符、制表符。
  - remove_urls_emails: 删除 URL、电子邮件地址。
  - enabled (bool): 是否选择此规则。如果未传入文档 ID，则表示默认值。
- segmentation (object): 分段规则。
  - separator: 自定义段标识符，目前只允许设置一个分隔符。默认是 \n。
  - max_tokens: 最大长度（标记），默认为 1000。
name (可选): 文档的名称。

示例 cURL 请求：

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/update_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{\"name\":\"Dify\",\"indexing_technique\":\"high_quality\",\"process_rule\":{\"rules\":{\"pre_processing_rules\":[{\"id\":\"remove_extra_spaces\",\"enabled\":true},{\"id\":\"remove_urls_emails\",\"enabled\":true}],\"segmentation\":{\"separator\":\"###\",\"max_tokens\":500}},\"mode\":\"custom\"}}";type=text/plain' \
--form 'file=@"/path/to/file"'

响应示例：

{
  "document": {
    "id": "",
    "position": 1,
    "data_source_type": "upload_file",
    "data_source_info": {
      "upload_file_id": ""
    },
    "dataset_process_rule_id": "",
    "name": "Dify.txt",
    "created_from": "api",
    "created_by": "",
    "created_at": 1695308667,
    "tokens": 0,
    "indexing_status": "waiting",
    "error": null,
    "enabled": true,
    "disabled_at": null,
    "disabled_by": null,
    "archived": false,
    "display_status": "queuing",
    "word_count": 0,
    "hit_count": 0,
    "doc_form": "text_model"
  },
  "batch": "20230921150427533684"
}

此外，还可以包含其他可选参数，如 indexing_technique、process_rule 和 name。例如：

def update_document_by_file(self, document_id, file_path, extra_params: dict = None):
    """
    Update a document by file.
    :param document_id: ID of the document
    :param file_path: Path to the file
    :param extra_params: extra parameters pass to the API, such as indexing_technique, process_rule, name. (optional)
        e.g.
        {
        'name': 'Dify',
        'indexing_technique': 'high_quality',
        'process_rule': {
            'rules': {
                'pre_processing_rules': [
                    {'id': 'remove_extra_spaces', 'enabled': True},
                    {'id': 'remove_urls_emails', 'enabled': True}
                ],
                'segmentation': {
                    'separator': '\n',
                    'max_tokens': 1000
                }
            },
            'mode': 'custom'
        }
    }
    :return:
    """
    files = {"file": open(file_path, "rb")}
    data = {}
    if extra_params is not None and isinstance(extra_params, dict):
        data.update(extra_params)
    url = f"/datasets/{self._get_dataset_id()}/documents/{document_id}/update_by_file"
    return self._send_request_with_files("POST", url, {"data": json.dumps(data)}, files)

在这个示例中，extra_params 可以包含 indexing_technique、process_rule 和 name 作为可选参数。process_rule 可以进一步包含 mode、rules、pre_processing_rules 和 segmentation，其中包含 separator 和 max_tokens 属性^[1]^[2]^[3]^[4]^[5]。

To continue talking to Dosu, mention @dosu.

tigflanker commented 1 month ago

@dosu 非常溜！

langgenius / dify

How to find document_id and segment_id in Web UI or command line? #8835

Self Checks

Provide a description of requested docs changes