How to Recognize Images Generated Within the Workflow Using V&L Model?

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.9

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

In the workflow, a vision-equipped LLM can recognize images uploaded by users. But what should I do if I have an image generated within the workflow that needs to be recognized? For example, I used the SearXNG tool to obtain an image from the internet and then wanted to use the LLM to recognize it. However, I found that the LLM could not access this image no matter what.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

To enable the LLM to recognize images generated within the workflow in Dify version 0.6.9, self-hosted with Docker, you need to ensure that the vision property within the LLMNodeData class is set to have enabled: true. Here is the relevant code snippet from the api/core/workflow/nodes/llm/entities.py file:

from typing import Any, Literal, Optional, Union

from pydantic import BaseModel

from core.prompt.entities.advanced_prompt_entities import ChatModelMessage, CompletionModelPromptTemplate, MemoryConfig
from core.workflow.entities.base_node_data_entities import BaseNodeData
from core.workflow.entities.variable_entities import VariableSelector

class ModelConfig(BaseModel):
    """
    Model Config.
    """
    provider: str
    name: str
    mode: str
    completion_params: dict[str, Any] = {}

class ContextConfig(BaseModel):
    """
    Context Config.
    """
    enabled: bool
    variable_selector: Optional[list[str]] = None

class VisionConfig(BaseModel):
    """
    Vision Config.
    """
    class Configs(BaseModel):
        """
        Configs.
        """
        detail: Literal['low', 'high']

    enabled: bool
    configs: Optional[Configs] = None

class PromptConfig(BaseModel):
    """
    Prompt Config.
    """
    jinja2_variables: Optional[list[VariableSelector]] = None

class LLMNodeChatModelMessage(ChatModelMessage):
    """
    LLM Node Chat Model Message.
    """
    jinja2_text: Optional[str] = None

class LLMNodeCompletionModelPromptTemplate(CompletionModelPromptTemplate):
    """
    LLM Node Chat Model Prompt Template.
    """
    jinja2_text: Optional[str] = None

class LLMNodeData(BaseNodeData):
    """
    LLM Node Data.
    """
    model: ModelConfig
    prompt_template: Union[list[LLMNodeChatModelMessage], LLMNodeCompletionModelPromptTemplate]
    prompt_config: Optional[PromptConfig] = None
    memory: Optional[MemoryConfig] = None
    context: ContextConfig
    vision: VisionConfig

Make sure that the vision object has enabled set to true:

vision: {
  enabled: true,
  configs: {
    detail: 'high'  # or 'low' based on your requirement
  }
}

This configuration will enable the LLM to recognize images generated within the workflow ^[1].

To continue talking to Dosu, mention @dosu.

langgenius / dify