Add Multimodal Input Support (Image, Audio, Video) to App-UI in MS-Swift Library

The MS-Swift library currently supports models capable of processing multimodal input (image, audio, video) via the web-UI. However, this functionality is not available in the app UI. We request the inclusion of multimodal input support in the app-UI to enable seamless integration and usage of models with multimodal capabilities, aligning it with the web UI's features.

Adding this feature will enhance the MS-Swift library's usability in mobile or desktop application development, ensuring consistent multimodal support across platforms. This could involve creating APIs for uploading and processing different data modalities and providing developers with examples or templates for implementation. Such an update would broaden the library’s applicability in real-world scenarios, such as multimedia content analysis, accessibility tools, and creative applications.

modelscope / ms-swift

Add Multimodal Input Support (Image, Audio, Video) to App-UI in MS-Swift Library #2469