langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
45.3k stars 6.36k forks source link

use dataset api add segment to slowly #7717

Open kenneth-bro opened 2 weeks ago

kenneth-bro commented 2 weeks ago

Self Checks

Dify version

0.7.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Use a data set API to write code to edit the knowledge base, so that the upload is too slow. 20 texts were uploaded for a while, so it was impossible to use API for business data for data set maintenance. DIFY-API and DIFY-works have no error.

✔️ Expected Behavior

fast and correct to put segment to document

❌ Actual Behavior

No response

dosubot[bot] commented 2 weeks ago

To add segments to a document more quickly in Dify version 0.7.2 when self-hosted with Docker, you can use the dataset API's endpoint for adding segments. Here is the relevant API call:

curl --location 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data '{"segments": [{"content": "1","answer": "1","keywords": ["a"]}]}'

Replace {dataset_id} with your dataset ID, {document_id} with your document ID, and {api_key} with your API key. The segments array can contain multiple segment objects to add them in a batch, which can help speed up the process.

Additionally, Dify provides a batch import API for adding segments from a CSV file, which can further streamline the process:

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments/batch_import' \
--header 'Authorization: Bearer {api_key}' \
--form 'file=@/path/to/your/file.csv'

Ensure your CSV file is formatted correctly, with each row representing a segment. This method allows you to upload multiple segments at once, significantly speeding up the process.

For more advanced usage, such as creating documents by text or file, and managing document indexing, you can refer to the Dify API documentation. Here are some additional useful API calls:

Create a document by text:

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "Dify",
    "text": "Dify means Do it for you...",
    "indexing_technique": "high_quality",
    "process_rule": {
        "rules": {
            "pre_processing_rules": [{
                "id": "remove_extra_spaces",
                "enabled": true
            }, {
                "id": "remove_urls_emails",
                "enabled": true
            }],
            "segmentation": {
                "separator": "###",
                "max_tokens": 500
            }
        },
        "mode": "custom"
    }
}'

Create a document by file:

curl --location POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{
    "name": "Dify",
    "indexing_technique": "high_quality",
    "process_rule": {
        "rules": {
            "pre_processing_rules": [{
                "id": "remove_extra_spaces",
                "enabled": true
            }, {
                "id": "remove_urls_emails",
                "enabled": true
            }],
            "segmentation": {
                "separator": "###",
                "max_tokens": 500
            }
        },
        "mode": "custom"
    }
}"; type=text/plain' \
--form 'file=@"/path/to/file"'

These additional methods can help you manage your documents and segments more efficiently.

Moreover, Dify's DocumentService and SegmentService classes provide additional functionalities for handling documents and segments programmatically. For instance, the multi_create_segment method in SegmentService allows for the creation of multiple segments in a single operation, which can be particularly useful for batch processing:

@classmethod
def multi_create_segment(cls, segments: list, document: Document, dataset: Dataset):
    lock_name = 'multi_add_segment_lock_document_id_{}'.format(document.id)
    with redis_client.lock(lock_name, timeout=600):
        embedding_model = None
        if dataset.indexing_technique == 'high_quality':
            model_manager = ModelManager()
            embedding_model = model_manager.get_model_instance(
                tenant_id=current_user.current_tenant_id,
                provider=dataset.embedding_model_provider,
                model_type=ModelType.TEXT_EMBEDDING,
                model=dataset.embedding_model
            )
        max_position = db.session.query(func.max(DocumentSegment.position)).filter(
            DocumentSegment.document_id == document.id
        ).scalar()
        segment_data_list = []
        for segment_item in segments:
            content = segment_item['content']
            doc_id = str(uuid.uuid4())
            segment_hash = helper.generate_text_hash(content)
            tokens = 0
            if dataset.indexing_technique == 'high_quality' and embedding_model:
                tokens = embedding_model.get_text_embedding_num_tokens(
                    texts=[content]
                )
            segment_document = DocumentSegment(
                tenant_id=current_user.current_tenant_id,
                dataset_id=document.dataset_id,
                document_id=document.id,
                index_node_id=doc_id,
                index_node_hash=segment_hash,
                position=max_position + 1 if max_position else 1,
                content=content,
                word_count=len(content),
                tokens=tokens,
                status='completed',
                indexing_at=datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None),
                completed_at=datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None),
                created_by=current_user.id
            )
            if document.doc_form == 'qa_model':
                segment_document.answer = segment_item['answer']
            db.session.add(segment_document)
            segment_data_list.append(segment_document)
        db.session.commit()
        return segment_data_list

Using these methods, you can efficiently manage and add segments to your documents in Dify [1][2][3][4].

To continue talking to Dosu, mention @dosu.

crazywoola commented 2 weeks ago

What LLM model and Embedding Model are you using?

kenneth-bro commented 2 weeks ago

What LLM model and Embedding Model are you using?

beg-large-zh-v1.5, use xinference build, other system is not problem.

fedecompa commented 1 week ago

In the 0.7.3 version of Dify the following API gives 404, indeed there is no the batch_import API in the documentation.

curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments/batch_import' \ --header 'Authorization: Bearer {api_key}' \ --form 'file=@/path/to/your/file.csv'