langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
11.79k stars 1.97k forks source link

ConversationChain not suppot GPT-4 Turbo with Vision model(gpt-4-turbo-2024-04-09), with image inputs #5326

Open WangBiaoxuan opened 2 months ago

WangBiaoxuan commented 2 months ago

Checked other resources

Example Code

var chain = new ConversationChain({
           // this.model is gpt-4-turbo  with Vision model,so i want to pass image to gpt-4-turbo
            llm: this.model,
            memory: this.memory,
        })

var input = new HumanMessage({
                content: [
                    {
                        type: 'text',
                        text: '描述这个图片',
                    },
                    {
                        type: 'image_url',
                        image_url: {
                            url: 'https://www.freecodecamp.org/news/content/images/2023/05/Screenshot-2023-05-29-at-5.40.38-PM.png',
                            detail: 'low',
                        },
                    },
                ],
            });
chain.call({ input: input })

Error Message and Stack Trace (if applicable)

No response

Description

var chain = new ConversationChain({
           // this.model is gpt-4-turbo  with Vision model,so i want to pass image to gpt-4-turbo
            llm: this.model,
            memory: this.memory,
        })

var input = new HumanMessage({
                content: [
                    {
                        type: 'text',
                        text: 'Describe this image',
                    },
                    {
                        type: 'image_url',
                        image_url: {
                            url: 'https://www.freecodecamp.org/news/content/images/2023/05/Screenshot-2023-05-29-at-5.40.38-PM.png',
                            detail: 'low',
                        },
                    },
                ],
            });
chain.call({ input: input })

When executing this code, an error will be reported。

i find langchainjs code,find input only support string,but above code input is a Array

export const DEFAULT_TEMPLATE = `The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
{history}
Human: {input}
AI:`;

System Info

platform:mac node:v20.11.1

dosubot[bot] commented 2 months ago

To address the issue of passing an image along with text to the GPT-4 Turbo with Vision model using LangChain.js, you'll need to ensure your input is correctly formatted. The model expects a MessageContentComplex array that includes both text and image data. Here's a streamlined approach to achieve this:

  1. Prepare the Image and Text Data: If you're using a base64-encoded image, ensure it's correctly prefixed (e.g., data:image/png;base64, for PNG images). For online images, directly use the URL.

  2. Construct the Input Array: Combine your text and image data into a single array. For a base64-encoded image, the structure would look like this:

    const messageContent = [
      {
        type: "text",
        text: "What is in this image?",
      },
      {
        type: "image_url",
        image_url: `data:image/png;base64,${your_base64_encoded_image_data}`,
      },
    ];
  3. Invoke the Model: Use the prepared input to call the model. Ensure your model instance is set up for vision capabilities.

    const model = new GoogleLLM({
      modelName: "gemini-pro-vision",
    });
    
    const response = await model.invoke(input);
    console.log("Response:", response);

For converting images to base64 in Node.js, you can use the fs module:

import fs from "node:fs/promises";

const image = await fs.readFile("path_to_your_image.jpg");
const base64Image = image.toString("base64");

This method allows you to leverage the multimodal capabilities of the GPT-4 Turbo with Vision model within LangChain.js effectively. For further details, refer to the LangChain.js documentation and examples.

#### Details **Sources** - [Support for multiple inputs in memory](https://github.com/langchain-ai/langchainjs/issues/391) - [examples/src/models/chat/integration_openai_vision.ts](https://github.com/langchain-ai/langchainjs/blob/main/examples/src/models/chat/integration_openai_vision.ts) - [cookbook/openai_vision_multimodal.ipynb](https://github.com/langchain-ai/langchainjs/blob/main/cookbook/openai_vision_multimodal.ipynb) - [libs/langchain-google-gauth/src/tests/llms.int.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-google-gauth/src/tests/llms.int.test.ts) - [libs/langchain-google-webauth/src/tests/llms.int.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-google-webauth/src/tests/llms.int.test.ts) - [examples/src/prompts/multi_modal_inline.ts](https://github.com/langchain-ai/langchainjs/blob/main/examples/src/prompts/multi_modal_inline.ts) - [libs/langchain-openai/src/tests/chat_models-vision.int.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-openai/src/tests/chat_models-vision.int.test.ts) - [libs/langchain-azure-openai/src/tests/chat_models-vision.int.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-azure-openai/src/tests/chat_models-vision.int.test.ts)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

WangBiaoxuan commented 1 month ago

@dosu your example do not use ConversationChain,i want to use ConversationChain

dosubot[bot] commented 1 month ago

To effectively use ConversationChain with GPT-4 Turbo with Vision model in LangChain.js for handling both text and image inputs, it's important to understand that ConversationChain is primarily designed for text interactions and does not natively support complex objects or arrays, including those containing image data.

Since ConversationChain expects string inputs, a workaround is to process the image and text inputs separately and then integrate the responses into your conversation flow. Here's a conceptual approach to achieve this:

  1. Process Image and Text Separately: Use a model instance that supports vision capabilities to process the image. This step involves sending the image to the model and receiving a text description or relevant output.

  2. Integrate Model Responses into ConversationChain: Once you have the text output from the vision model, you can incorporate this text into your ConversationChain as part of the conversation. This means you'll manually handle the image processing part outside of ConversationChain and then use the resulting text for the conversation.

  3. Example Workflow:

    • First, use a vision-capable model to process the image and obtain a text description.
    • Next, incorporate this text description into your conversation flow within ConversationChain by treating it as a regular text input.

This approach allows you to leverage the capabilities of GPT-4 Turbo with Vision by manually bridging the gap between vision and text processing. Remember, the key is to handle the image processing separately and then feed the resulting text into the ConversationChain.

For processing images, you might use a separate instance or method specifically designed for handling vision tasks, then integrate the output into your conversation flow. This ensures you can still utilize the advanced capabilities of GPT-4 Turbo with Vision within the constraints of ConversationChain's design, which is optimized for text-based interactions.

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.