Image OCR index using vision model

KwanFaiLau commented 3 months ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.12fix1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

pdf documents might have images embeded inside , it cann't parsing the image via vision model to extract to text format and index into vector database

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 3 months ago

@KwanFaiLau Hello there! I'm here to help you with bugs, questions, and becoming a contributor. Let me know how I can assist you while you wait for a human maintainer.

To enable the LLM to recognize images embedded in PDF documents using the vision model in Dify version 0.6.12fix1, self-hosted with Docker, you need to ensure that the vision property within the LLMNodeData class is set to have enabled: true. Here is the relevant code snippet from the api/core/workflow/nodes/llm/entities.py file:

from typing import Any, Literal, Optional, Union

from pydantic import BaseModel

from core.prompt.entities.advanced_prompt_entities import ChatModelMessage, CompletionModelPromptTemplate, MemoryConfig
from core.workflow.entities.base_node_data_entities import BaseNodeData
from core.workflow.entities.variable_entities import VariableSelector

class ModelConfig(BaseModel):
    """
    Model Config.
    """
    provider: str
    name: str
    mode: str
    completion_params: dict[str, Any] = {}

class ContextConfig(BaseModel):
    """
    Context Config.
    """
    enabled: bool
    variable_selector: Optional[list[str]] = None

class VisionConfig(BaseModel):
    """
    Vision Config.
    """
    class Configs(BaseModel):
        """
        Configs.
        """
        detail: Literal['low', 'high']

    enabled: bool
    configs: Optional[Configs] = None

class PromptConfig(BaseModel):
    """
    Prompt Config.
    """
    jinja2_variables: Optional[list[VariableSelector]] = None

class LLMNodeChatModelMessage(ChatModelMessage):
    """
    LLM Node Chat Model Message.
    """
    jinja2_text: Optional[str] = None

class LLMNodeCompletionModelPromptTemplate(CompletionModelPromptTemplate):
    """
    LLM Node Chat Model Prompt Template.
    """
    jinja2_text: Optional[str] = None

class LLMNodeData(BaseNodeData):
    """
    LLM Node Data.
    """
    model: ModelConfig
    prompt_template: Union[list[LLMNodeChatModelMessage], LLMNodeCompletionModelPromptTemplate]
    prompt_config: Optional[PromptConfig] = None
    memory: Optional[MemoryConfig] = None
    context: ContextConfig
    vision: VisionConfig

Make sure that the vision object has enabled set to true:

vision: {
  enabled: true,
  configs: {
    detail: 'high'  # or 'low' based on your requirement
  }
}

This configuration will enable the LLM to recognize images generated within the workflow ^[1].

To continue talking to Dosu, mention @dosu.

KwanFaiLau commented 3 months ago

where is the vision object ,can you tell me the file path

dosubot[bot] commented 3 months ago

The vision object is part of the LLMNodeData class in the api/core/workflow/nodes/llm/entities.py file. You can find it within the LLMNodeData class definition. Here is the relevant section of the file path:

api/core/workflow/nodes/llm/entities.py

Look for the LLMNodeData class in this file, and you will see the vision property defined there.

To continue talking to Dosu, mention @dosu.

KwanFaiLau commented 3 months ago

when i modify

class VisionConfig(BaseModel):
    """
    Vision Config.
    """
    class Configs(BaseModel):
        """
        Configs.
        """
        detail: Literal['low', 'high']

    enabled: bool
    configs: Optional[Configs] = None

as

class VisionConfig(BaseModel):
    """
    Vision Config.
    """
    class Configs(BaseModel):
        """
        Configs.
        """
        detail: Literal['low', 'high']

    enabled: true
    configs: Optional[Configs] = None

the docker "docker-api-1" can not start .

langgenius / dify