Using Vision model "minicpm-v:8b-2.6-q4_K_M" with ollama, chat message raised error message

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Deploy ollama 0.3.12
Pull Vision model "minicpm-v:8b-2.6-q4_K_M"
```
ollama run minicpm-v:8b-2.6-q4_K_M
```
Go to dify 0.8.3 -> add new provider, add this vision model with ollama definition.
Create blank app using this vision model.
In "Studio" "Orchestrate" page, change vision setting -> Resolution to Low.
Upload image, then chat in "Debug and Preview"

✔️ Expected Behavior

Chat will output normal from start to end.

❌ Actual Behavior

Chat output normal at first time, but then stop and clear output text, then it raised error at the top right of the page, saying:

Error: Input [TextPromptMessageContent(type=<PromptMessageContentType.TEXT: 'text'>, data='图片表述什么内容？'), ImagePromptMessageContent(type=<PromptMessageContentType.IMAGE: 'image'>, data='data:image/jpeg;base64,/9j/4AAQSkZJRgA...[ignore]...', detail=<DETAIL.LOW: 'low'>)] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Attach is from docker logs after setting DEBUG=true

dify-ollama-error-minicpm-v.zip

Hey @HorseLuke! 👋 I'm here to help you with bug fixes, answer your questions, and guide you to become a contributor. Let's squash those bugs together!

I found two similar discussions that might be relevant to your issue:

Adding a local ollama model error which was last updated on 2024-09-12.
Adding a local ollama model error which was last updated on 2024-09-12.

To resolve the error "Input [TextPromptMessageContent(type=<PromptMessageContentType.TEXT: 'text'>, data='图片表述什么内容？'), ImagePromptMessageContent(type=<PromptMessageContentType.IMAGE: 'image'>, data='data:image/jpeg;base64,/9j/4AAQSkZJRgA...[ignore]...', detail=<DETAIL.LOW: 'low'>)] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers" when using the Vision model "minicpm-v:8b-2.6-q4_K_M" with ollama in dify 0.8.3, you need to ensure that the prompt_messages are correctly formatted.

The _generate method in the OllamaLargeLanguageModel class expects the prompt_messages to be converted into a dictionary format that the API can understand. Specifically, the content of UserPromptMessage should be a string or a list/tuple of strings or integers.

Here is the relevant part of the code that handles this conversion:

def _convert_prompt_message_to_dict(self, message: PromptMessage) -> dict:
    """
    Convert PromptMessage to dict for Ollama API
    """
    if isinstance(message, UserPromptMessage):
        message = cast(UserPromptMessage, message)
        if isinstance(message.content, str):
            message_dict = {"role": "user", "content": message.content}
        else:
            text = ""
            images = []
            for message_content in message.content:
                if message_content.type == PromptMessageContentType.TEXT:
                    message_content = cast(TextPromptMessageContent, message_content)
                    text = message_content.data
                elif message_content.type == PromptMessageContentType.IMAGE:
                    message_content = cast(ImagePromptMessageContent, message_content)
                    image_data = re.sub(r"^data:image\/[a-zA-Z]+;base64,", "", message_content.data)
                    images.append(image_data)
            message_dict = {"role": "user", "content": text, "images": images}
    elif isinstance(message, AssistantPromptMessage):
        message = cast(AssistantPromptMessage, message)
        message_dict = {"role": "assistant", "content": message.content}
    elif isinstance(message, SystemPromptMessage):
        message = cast(SystemPromptMessage, message)
        message_dict = {"role": "system", "content": message.content}
    else:
        raise ValueError(f"Got unknown type {message}")

    return message_dict

Ensure that your prompt_messages are structured correctly before invoking the model. For example:

prompt_messages = [
    UserPromptMessage(content=[
        TextPromptMessageContent(type=PromptMessageContentType.TEXT, data='图片表述什么内容？'),
        ImagePromptMessageContent(type=PromptMessageContentType.IMAGE, data='data:image/jpeg;base64,/9j/4AAQSkZJRgA...')
    ])
]

This will be converted to:

{
    "role": "user",
    "content": "图片表述什么内容？",
    "images": ["base64_encoded_image_data"]
}

Make sure that the content field is a string or a list/tuple of strings or integers, and the images field contains the base64 encoded image data ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

@dosu Based on your context, I can not find anything that can change content field in "Orchestrate" page.

However, if I add a sentense in "Instructions" textarea, then the error disappear. "Instructions" textarea is described as these text:

Prompts guide AI responses with instructions and constraints. Insert variables like {{input}}. This prompt won't be visible to users.

What does "Instructions" textarea used in content field? Is this a bug when "Instructions" textarea is empty?

The "Instructions" textarea in the content field is used to input and display the opening statement. It is not a bug when the "Instructions" textarea is empty. The code handles both cases where the textarea has content and where it is empty, providing appropriate placeholders and UI elements for each scenario.

Here is the relevant code snippet that demonstrates this:

const value = openingStatement?.opening_statement || '';
const hasValue = !!(value || '').trim();
...
<div className='text-gray-700 text-sm'>
  {(hasValue || (!hasValue && isFocus)) ? (
    <>
      {isFocus
        ? (
          <div>
            <textarea
              ref={inputRef}
              value={tempValue}
              rows={3}
              onChange={e => setTempValue(e.target.value)}
              className="w-full px-0 text-sm  border-0 bg-transparent focus:outline-none "
              placeholder={t('appDebug.openingStatement.placeholder') as string}
            >
            </textarea>
          </div>
        )
        : (
          <div dangerouslySetInnerHTML={{
            __html: coloredContent,
          }}></div>
        )}
      {renderQuestions()}
    </>) : (
    <div className='pt-2 pb-1 text-xs text-gray-500'>{t('appDebug.openingStatement.noDataPlaceHolder')}</div>
  )}
</div>

The code checks if there is a value in the opening_statement and handles the UI accordingly, showing a placeholder when the textarea is empty ^[1].

Additionally, in the SegmentDetailComponent from the Dify platform, the AutoHeightTextarea component is used to handle text input fields, including scenarios where the content might be empty. This further confirms that the application is designed to handle empty textareas gracefully ^[2].

To continue talking to Dosu, mention @dosu.

langgenius / dify