langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
51.81k stars 7.51k forks source link

How to find document_id and segment_id in Web UI or command line? #8835

Closed tigflanker closed 1 month ago

tigflanker commented 1 month ago

Self Checks

Provide a description of requested docs changes

Hi, I am using the API functionality of the knowledge base and plan to update the content of a document within a knowledge base.

However, I don't know how to locate the document_id (and furthermore, how to locate the segment_id). /datasets/{dataset_id}/documents/{document_id}/segments/{segment_id}

I know how to locate the dataset_id, as there is a query command that can be requested with

curl --location --request GET 'http://xxx/v1/datasets?page=1&limit=20' 

So, Is there a similar command or more refined expansion parameters that can be used to get the document_id?

dosubot[bot] commented 1 month ago

To find the document_id and segment_id in the Web UI or command line for updating a document within a knowledge base, you can use the following API calls:

  1. List Documents in a Dataset:

    • This will help you find the document_id for a specific document within a dataset.
    • Command:
      curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents' \
      --header 'Authorization: Bearer {api_key}'
    • Example Output:
      {
      "data": [
       {
         "id": "document_id_here",
         "position": 1,
         "data_source_type": "file_upload",
         "data_source_info": null,
         "dataset_process_rule_id": null,
         "name": "dify",
         "created_from": "",
         "created_by": "",
         "created_at": 1681623639,
         "tokens": 0,
         "indexing_status": "waiting",
         "error": null,
         "enabled": true,
         "disabled_at": null,
         "disabled_by": null,
         "archived": false
       },
      ],
      "has_more": false,
      "limit": 20,
      "total": 9,
      "page": 1
      }
  2. List Segments of a Document:

    • This will help you find the segment_id for a specific segment within a document.
    • Command:
      curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments' \
      --header 'Authorization: Bearer {api_key}' \
      --header 'Content-Type: application/json'
    • Example Output:
      {
      "data": [{
       "id": "segment_id_here",
       "position": 1,
       "document_id": "document_id_here",
       "content": "1",
       "answer": "1",
       "word_count": 25,
       "tokens": 0,
       "keywords": [
           "a"
       ],
       "index_node_id": "",
       "index_node_hash": "",
       "hit_count": 0,
       "enabled": true,
       "disabled_at": null,
       "disabled_by": null,
       "status": "completed",
       "created_by": "",
       "created_at": 1695312007,
       "indexing_at": 1695312007,
       "completed_at": 1695312007,
       "error": null,
       "stopped_at": null
      }],
      "doc_form": "text_model"
      }

Additionally, you can navigate to the knowledge base page in the Web UI and switch to the API ACCESS page from the left navigation. On this page, you can view the dataset API documentation provided by Dify and manage the credentials for accessing the dataset API in API Keys [1][2].

For more detailed information on the API endpoints and their usage, you can refer to the Dify Knowledge API documentation. Here are some relevant endpoints:

To continue talking to Dosu, mention @dosu.

tigflanker commented 1 month ago

Great !!

tigflanker commented 1 month ago

@dosu Dosu,能否帮我找到更完整的参数列表?我在使用通过API文档上传更新知识库的功能,如:

--header 'Authorization: Bearer {api_key}' \
--form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain' \
--form 'file=@"/path/to/file"'

我期望找到其中form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain'更全的参数文档说明,感谢

dosubot[bot] commented 1 month ago

data 字段在 update_by_file 端点的 API 调用中可以包含以下参数:

示例 cURL 请求:

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/update_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{\"name\":\"Dify\",\"indexing_technique\":\"high_quality\",\"process_rule\":{\"rules\":{\"pre_processing_rules\":[{\"id\":\"remove_extra_spaces\",\"enabled\":true},{\"id\":\"remove_urls_emails\",\"enabled\":true}],\"segmentation\":{\"separator\":\"###\",\"max_tokens\":500}},\"mode\":\"custom\"}}";type=text/plain' \
--form 'file=@"/path/to/file"'

响应示例:

{
  "document": {
    "id": "",
    "position": 1,
    "data_source_type": "upload_file",
    "data_source_info": {
      "upload_file_id": ""
    },
    "dataset_process_rule_id": "",
    "name": "Dify.txt",
    "created_from": "api",
    "created_by": "",
    "created_at": 1695308667,
    "tokens": 0,
    "indexing_status": "waiting",
    "error": null,
    "enabled": true,
    "disabled_at": null,
    "disabled_by": null,
    "archived": false,
    "display_status": "queuing",
    "word_count": 0,
    "hit_count": 0,
    "doc_form": "text_model"
  },
  "batch": "20230921150427533684"
}

此外,还可以包含其他可选参数,如 indexing_techniqueprocess_rulename。例如:

def update_document_by_file(self, document_id, file_path, extra_params: dict = None):
    """
    Update a document by file.
    :param document_id: ID of the document
    :param file_path: Path to the file
    :param extra_params: extra parameters pass to the API, such as indexing_technique, process_rule, name. (optional)
        e.g.
        {
        'name': 'Dify',
        'indexing_technique': 'high_quality',
        'process_rule': {
            'rules': {
                'pre_processing_rules': [
                    {'id': 'remove_extra_spaces', 'enabled': True},
                    {'id': 'remove_urls_emails', 'enabled': True}
                ],
                'segmentation': {
                    'separator': '\n',
                    'max_tokens': 1000
                }
            },
            'mode': 'custom'
        }
    }
    :return:
    """
    files = {"file": open(file_path, "rb")}
    data = {}
    if extra_params is not None and isinstance(extra_params, dict):
        data.update(extra_params)
    url = f"/datasets/{self._get_dataset_id()}/documents/{document_id}/update_by_file"
    return self._send_request_with_files("POST", url, {"data": json.dumps(data)}, files)

在这个示例中,extra_params 可以包含 indexing_techniqueprocess_rulename 作为可选参数。process_rule 可以进一步包含 moderulespre_processing_rulessegmentation,其中包含 separatormax_tokens 属性[1][2][3][4][5]

To continue talking to Dosu, mention @dosu.

tigflanker commented 1 month ago

@dosu 非常溜!