Update segments via api

jos-bek commented 1 month ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit report (我已阅读并同意 Language Policy).
[X] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Provide a description of requested docs changes

Hi, I'm currently trying to update segments via the API. I have noticed a few things that do not quite match the documentation described. The request body described is not quite correct:

<Property name='segment' type='object list' key='segment'> It should be: <Property name='segment' type='object' key='segment'>

I also noticed that if keywords is not defined in the request body, the segment is disabled and the error keywords is set to the segment object. Keywords are described as optional in the documentation. I would also expect the API to return an Error if it is required.

Enabling and disabling segments also does not behave as expected. I thought I could simply set enabled to true or false. Unfortunately, sometimes it has no effect and I can't really understand why.

dosubot[bot] commented 1 month ago

Hello @jos-bek, I'm here to help you with any bugs, questions, or contributions you may have. Let me know how I can assist you!

To clarify the behavior for updating segments via the API, here are the key points regarding the request body format, the optional nature of keywords, and the behavior of the enabled property:

Request Body Format:
- The request body should be a dictionary containing the fields to be updated.
- Required fields:
  - content: The content of the segment.
- Optional fields:
  - answer: Required if the document form is 'qa_model'.
  - keywords: Optional, can be provided to update the segment's keywords.
  - enabled: Optional, can be provided to enable or disable the segment.
Optional Nature of keywords:
- The keywords field is optional. If provided, it will update the segment's keywords.
- If keywords are provided, the segment index will be updated accordingly.
Behavior of the enabled Property:
- The enabled property can be used to enable or disable a segment.
- If the segment is currently enabled and the enabled property is set to False, the segment will be disabled, and the disabled_at and disabled_by fields will be updated.
- If the segment is currently disabled, it cannot be updated unless the enabled property is set to True.
- If the segment is disabled and the enabled property is not provided or set to False, an error will be raised.

Here is the relevant code snippet for updating a segment:

@classmethod
def update_segment(cls, args: dict, segment: DocumentSegment, document: Document, dataset: Dataset):
    indexing_cache_key = 'segment_{}_indexing'.format(segment.id)
    cache_result = redis_client.get(indexing_cache_key)
    if cache_result is not None:
        raise ValueError("Segment is indexing, please try again later")
    if 'enabled' in args and args['enabled'] is not None:
        action = args['enabled']
        if segment.enabled != action:
            if not action:
                segment.enabled = action
                segment.disabled_at = datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
                segment.disabled_by = current_user.id
                db.session.add(segment)
                db.session.commit()
                # Set cache to prevent indexing the same segment multiple times
                redis_client.setex(indexing_cache_key, 600, 1)
                disable_segment_from_index_task.delay(segment.id)
                return segment
    if not segment.enabled:
        if 'enabled' in args and args['enabled'] is not None:
            if not args['enabled']:
                raise ValueError("Can't update disabled segment")
        else:
            raise ValueError("Can't update disabled segment")
    try:
        content = args['content']
        if segment.content == content:
            if document.doc_form == 'qa_model':
                segment.answer = args['answer']
            if args.get('keywords'):
                segment.keywords = args['keywords']
            segment.enabled = True
            segment.disabled_at = None
            segment.disabled_by = None
            db.session.add(segment)
            db.session.commit()
            # update segment index task
            if args['keywords']:
                keyword = Keyword(dataset)
                keyword.delete_by_ids([segment.index_node_id])
                document = RAGDocument(
                    page_content=segment.content,
                    metadata={
                        "doc_id": segment.index_node_id,
                        "doc_hash": segment.index_node_hash,
                        "document_id": segment.document_id,
                        "dataset_id": segment.dataset_id,
                    }
                )
                keyword.add_texts([document], keywords_list=[args['keywords']])
        else:
            segment_hash = helper.generate_text_hash(content)
            tokens = 0
            if dataset.indexing_technique == 'high_quality':
                model_manager = ModelManager()
                embedding_model = model_manager.get_model_instance(
                    tenant_id=current_user.current_tenant_id,
                    provider=dataset.embedding_model_provider,
                    model_type=ModelType.TEXT_EMBEDDING,
                    model=dataset.embedding_model
                )

Request Body

{
  "segment": {
    "content": "1",
    "answer": "1",
    "keywords": ["a"],
    "enabled": false
  }
}

Response

{
  "data": [{
    "id": "",
    "position": 1,
    "document_id": "",
    "content": "1",
    "answer": "1",
    "word_count": 25,
    "tokens": 0,
    "keywords": [
        "a"
    ],
    "index_node_id": "",
    "index_node_hash": "",
    "hit_count": 0,
    "enabled": true,
    "disabled_at": null,
    "disabled_by": null,
    "status": "completed",
    "created_by": "",
    "created_at": 1695312007,
    "indexing_at": 1695312007,
    "completed_at": 1695312007,
    "error": null,
    "stopped_at": null
  }],
  "doc_form": "text_model"
}

This should provide the necessary clarification and update the documentation to reflect the correct behavior for updating segments via the API ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

JoJoJotarou commented 2 weeks ago

Yes, I have same question. keywords must exist. The same is true on the add segment api.

jos-bek commented 6 days ago

I also wonder whether it makes sense to separate these two things. So one endpoint to disable and enable segments and another to update keywords and content. Because in the code the segment is also returned if it is to be enabed or disabled without updating the content and keywords. And the issue with the keywords is this check if args['keywords']:. To make sure that the keywords key exists better be checked like before args.get('keywords'), otherwise it will throw the current error if keywords is not defined.

jos-bek commented 5 days ago

@dosu you could edit the label to bug, due to the error caused by the check if args['keywords']: in the function update_segment, which throws an error if keywords is undefined.

dosubot[bot] commented 5 days ago

To address the issues and suggestions raised:

Handling keywords Field:
- The check if args['keywords']: should be replaced with if 'keywords' in args: to ensure that the keywords key exists in the request body, preventing errors when keywords is not defined.
Separating Endpoints:
- It makes sense to separate the functionality into two endpoints: one for enabling/disabling segments and another for updating keywords and content. This will simplify the API and make it more intuitive.

Updated `update_segment` Function

Here is the updated update_segment function:

@classmethod
def update_segment(cls, args: dict, segment: DocumentSegment, document: Document, dataset: Dataset):
    indexing_cache_key = 'segment_{}_indexing'.format(segment.id)
    cache_result = redis_client.get(indexing_cache_key)
    if cache_result is not None:
        raise ValueError("Segment is indexing, please try again later")

    if 'enabled' in args and args['enabled'] is not None:
        action = args['enabled']
        if segment.enabled != action:
            if not action:
                segment.enabled = action
                segment.disabled_at = datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
                segment.disabled_by = current_user.id
                db.session.add(segment)
                db.session.commit()
                # Set cache to prevent indexing the same segment multiple times
                redis_client.setex(indexing_cache_key, 600, 1)
                disable_segment_from_index_task.delay(segment.id)
                return segment

    if not segment.enabled:
        if 'enabled' in args and args['enabled'] is not None:
            if not args['enabled']:
                raise ValueError("Can't update disabled segment")
        else:
            raise ValueError("Can't update disabled segment")

    try:
        content = args['content']
        if segment.content == content:
            if document.doc_form == 'qa_model':
                segment.answer = args['answer']
            if 'keywords' in args:
                segment.keywords = args['keywords']
            segment.enabled = True
            segment.disabled_at = None
            segment.disabled_by = None
            db.session.add(segment)
            db.session.commit()
            # update segment index task
            if 'keywords' in args:
                keyword = Keyword(dataset)
                keyword.delete_by_ids([segment.index_node_id])
                document = RAGDocument(
                    page_content=segment.content,
                    metadata={
                        "doc_id": segment.index_node_id,
                        "doc_hash": segment.index_node_hash,
                        "document_id": segment.document_id,
                        "dataset_id": segment.dataset_id,
                    }
                )
                keyword.add_texts([document], keywords_list=[args['keywords']])
        else:
            segment_hash = helper.generate_text_hash(content)
            tokens = 0
            if dataset.indexing_technique == 'high_quality':
                model_manager = ModelManager()
                embedding_model = model_manager.get_model_instance(
                    tenant_id=current_user.current_tenant_id,
                    provider=dataset.embedding_model_provider,
                    model_type=ModelType.TEXT_EMBEDDING,
                    model=dataset.embedding_model
                )
                # calc embedding use tokens
                tokens = embedding_model.get_text_embedding_num_tokens(
                    texts=[content]
                )
            segment.content = content
            segment.index_node_hash = segment_hash
            segment.word_count = len(content)
            segment.tokens = tokens
            segment.status = 'completed'
            segment.indexing_at = datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
            segment.completed_at = datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
            segment.updated_by = current_user.id
            segment.updated_at = datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
            segment.enabled = True
            segment.disabled_at = None
            segment.disabled_by = None
            if document.doc_form == 'qa_model':
                segment.answer = args['answer']
            db.session.add(segment)
            db.session.commit()
            # update segment vector index
            if 'keywords' in args:
                VectorService.update_segment_vector(args['keywords'], segment, dataset)

    except Exception as e:
        logging.exception("update segment index failed")
        segment.enabled = False
        segment.disabled_at = datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
        segment.status = 'error'
        segment.error = str(e)
        db.session.commit()
    segment = db.session.query(DocumentSegment).filter(DocumentSegment.id == segment.id).first()
    return segment

Separate Endpoints

Endpoint for Enabling/Disabling Segments:

@app.route('/datasets/<uuid:dataset_id>/segments/<uuid:segment_id>/<string:action>', methods=['PATCH'])
def enable_disable_segment(dataset_id, segment_id, action):
    dataset = DatasetService.get_dataset(str(dataset_id))
    if not dataset:
        return jsonify({"error": "Dataset not found"}), 404
    segment = DocumentSegment.query.filter(
        DocumentSegment.id == str(segment_id),
        DocumentSegment.tenant_id == current_user.current_tenant_id
    ).first()
    if not segment:
        return jsonify({"error": "Segment not found"}), 404
    if action not in ["enable", "disable"]:
        return jsonify({"error": "Invalid action"}), 400
    updated_segment = SegmentService.update_segment({"enabled": action == "enable"}, segment, None, dataset)
    return jsonify(updated_segment.to_dict())

Endpoint for Updating Keywords and Content:

@app.route('/datasets/<uuid:dataset_id>/documents/<uuid:document_id>/segments/<uuid:segment_id>', methods=['PATCH'])
def update_segment(dataset_id, document_id, segment_id):
    args = request.json
    dataset = DatasetService.get_dataset(str(dataset_id))
    if not dataset:
        return jsonify({"error": "Dataset not found"}), 404
    document = DocumentService.get_document(str(dataset_id), str(document_id))
    if not document:
        return jsonify({"error": "Document not found"}), 404
    segment = DocumentSegment.query.filter(
        DocumentSegment.id == str(segment_id),
        DocumentSegment.tenant_id == current_user.current_tenant_id
    ).first()
    if not segment:
        return jsonify({"error": "Segment not found"}), 404
    updated_segment = SegmentService.update_segment(args, segment, document, dataset)
    return jsonify(updated_segment.to_dict())

These changes should address the issues with the keywords field and separate the functionality into two distinct endpoints ^[1]^[2].

To continue talking to Dosu, mention @dosu.

langgenius / dify