modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.38k stars 385 forks source link

Add Multimodal Input Support (Image, Audio, Video) to App-UI in MS-Swift Library #2469

Open SushantGautam opened 1 week ago

SushantGautam commented 1 week ago

The MS-Swift library currently supports models capable of processing multimodal input (image, audio, video) via the web-UI. However, this functionality is not available in the app UI. We request the inclusion of multimodal input support in the app-UI to enable seamless integration and usage of models with multimodal capabilities, aligning it with the web UI's features.

Adding this feature will enhance the MS-Swift library's usability in mobile or desktop application development, ensuring consistent multimodal support across platforms. This could involve creating APIs for uploading and processing different data modalities and providing developers with examples or templates for implementation. Such an update would broaden the library’s applicability in real-world scenarios, such as multimedia content analysis, accessibility tools, and creative applications.