langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
38.57k stars 5.27k forks source link

How to Recognize Images Generated Within the Workflow Using V&L Model? #4937

Closed XiaoLey closed 3 days ago

XiaoLey commented 1 month ago

Self Checks

Dify version

0.6.9

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

In the workflow, a vision-equipped LLM can recognize images uploaded by users. But what should I do if I have an image generated within the workflow that needs to be recognized? For example, I used the SearXNG tool to obtain an image from the internet and then wanted to use the LLM to recognize it. However, I found that the LLM could not access this image no matter what.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 1 month ago

To enable the LLM to recognize images generated within the workflow in Dify version 0.6.9, self-hosted with Docker, you need to ensure that the vision property within the LLMNodeData class is set to have enabled: true. Here is the relevant code snippet from the api/core/workflow/nodes/llm/entities.py file:

from typing import Any, Literal, Optional, Union

from pydantic import BaseModel

from core.prompt.entities.advanced_prompt_entities import ChatModelMessage, CompletionModelPromptTemplate, MemoryConfig
from core.workflow.entities.base_node_data_entities import BaseNodeData
from core.workflow.entities.variable_entities import VariableSelector

class ModelConfig(BaseModel):
    """
    Model Config.
    """
    provider: str
    name: str
    mode: str
    completion_params: dict[str, Any] = {}

class ContextConfig(BaseModel):
    """
    Context Config.
    """
    enabled: bool
    variable_selector: Optional[list[str]] = None

class VisionConfig(BaseModel):
    """
    Vision Config.
    """
    class Configs(BaseModel):
        """
        Configs.
        """
        detail: Literal['low', 'high']

    enabled: bool
    configs: Optional[Configs] = None

class PromptConfig(BaseModel):
    """
    Prompt Config.
    """
    jinja2_variables: Optional[list[VariableSelector]] = None

class LLMNodeChatModelMessage(ChatModelMessage):
    """
    LLM Node Chat Model Message.
    """
    jinja2_text: Optional[str] = None

class LLMNodeCompletionModelPromptTemplate(CompletionModelPromptTemplate):
    """
    LLM Node Chat Model Prompt Template.
    """
    jinja2_text: Optional[str] = None

class LLMNodeData(BaseNodeData):
    """
    LLM Node Data.
    """
    model: ModelConfig
    prompt_template: Union[list[LLMNodeChatModelMessage], LLMNodeCompletionModelPromptTemplate]
    prompt_config: Optional[PromptConfig] = None
    memory: Optional[MemoryConfig] = None
    context: ContextConfig
    vision: VisionConfig

Make sure that the vision object has enabled set to true:

vision: {
  enabled: true,
  configs: {
    detail: 'high'  # or 'low' based on your requirement
  }
}

This configuration will enable the LLM to recognize images generated within the workflow [1].

To continue talking to Dosu, mention @dosu.