langflow-ai / langflow

Langflow is a low-code app builder for RAG and multi-agent AI applications. It’s Python-based and agnostic to any model, API, or database.
http://www.langflow.org
MIT License
28.95k stars 3.7k forks source link

multimodality #3530

Open zhengqianhe0 opened 3 weeks ago

zhengqianhe0 commented 3 weeks ago

Bug Description

I found that all input and output before langflow are text. I would like to know if there is currently a function to upload and parse images? This refers to multimodality, which means directly understanding the content of images or audio/video, rather than extracting text from images and processing it with large models. Does the current version 1.0.16 have this feature? Do you have any plans to integrate multimodal large models?

Reproduction

I can't find how to use components with multimodal capabilities。How can I use this

Expected behavior

Upload JPG image and parse it

Who can help?

No response

Operating System

windows 10

Langflow Version

1.0.16

Python Version

None

Screenshot

No response

Flow File

No response

dosubot[bot] commented 3 weeks ago

Yes, the current version 1.0.16 of Langflow supports uploading and parsing images for multimodal capabilities. Additionally, there are plans to add more types of Input and Output like Image and Audio, and to integrate more Components to help build more complex projects.

Supporting evidence:

For handling image uploads and parsing, the specific components or modules in Langflow are found in the src/backend/base/langflow/schema/image.py file. These include:

  1. Functions:

    • is_image_file(file_path): Verifies if a given file is an image.
    • get_file_paths(files: list[str]): Retrieves the full paths of the files from the storage service.
    • get_files(file_paths: list[str], convert_to_base64: bool = False): Fetches the files from the storage service and optionally converts them to base64.
  2. Class:

    • Image: A Pydantic model that includes methods for converting an image to base64 (to_base64), creating a content dictionary (to_content_dict), and generating a URL for the image (get_url). [2]

I found 2 similar open issues:

  1. Developer's suggestion which is open and was last updated on August 23, 2024. The issue has no comments or solution.
  2. [Feature Request] Multi-media I/O (Image, Sound, Video) which is closed and was last updated on March 20, 2024. The suggested solution is: "I am using langchain's documentloader to use gpt4 vision. https://python.langchain.com/docs/integrations/document_loaders/image" [3][4]

To continue talking to Dosu, mention @dosu.