[Feature Request]: GLM-4V

Feature Name

GLM-4V

Feature Description

GLM-4V-9B is an open source multimodal version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. GLM-4V-9B has the ability to conduct multi-round conversations in Chinese and English at a high resolution of 1120 * 1120. In multimodal evaluations of comprehensive Chinese and English abilities, perceptual reasoning, text recognition, and chart understanding, GLM-4V-9B has shown superior performance over GPT-4-turbo-2024-04-09, Gemini 1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.

Motivation

Multimodal Capabilities: GLM-4V-9B can process and understand both text and images, making it suitable for tasks like image captioning, visual question answering, and more.
High Resolution: Supports high-resolution inputs up to 1120x1120 pixels.
Multilingual Support: Capable of conducting multi-round conversations in both Chinese and English.
Performance: Demonstrates superior performance in various benchmarks, outperforming models like GPT-4-turbo, Gemini 1.0 Pro, and Claude 3 Opus.

Potential Solutions

https://open.bigmodel.cn/dev/api/normal-model/glm-4

Additional Context (optional)

No response

Affected Areas

None

Priority

Low

Required Files

[X] Test File
[X] Component File

swarmauri / swarmauri-sdk