Open g4challenge opened 8 months ago
Yes, having audio and video transcription would be a very useful feature.
This is one of the main feature which i was looking for when i installed openwebui..
In scientific research,it will be a very good feature to be able to record a meeting and then summarize it, and keep it in the workspace. Perhaps it should be compatible with milvus to store the audio and the notes?
I have used https://github.com/JigsawStack/insanely-fast-whisper-api and https://github.com/Vaibhavs10/insanely-fast-whisper
To add my 2 cents: Since openwebui has an integrated whisper running (and api possibility), it really feels like a wasted opportunity to not be able to use it directly. Same goes for TTS. I imagine a lot of the code is already there since they both are used behind the scenes.
But I have to acknowledge that openwebui is supposed to be a t2t UI, and starting to do stt and tts may be out of scope and increase complexity. In a perfect word, I host my own openai api whisper docker, connect it to openwebui docker, and for direct whisper usage I use another docker with a proper openai api compatible tts webui.
Still, would be very cool to have basic stt and tts using microphone and files in openwebui.
It would be great to support some common video formats as well, thanks! ☺️
Is your feature request related to a problem? Please describe. I find it challenging when I need to manually transcribe audio content. Whether it’s interviews, meetings, or recorded conversations, having an automated audio transcription feature would significantly improve my workflow.
Describe the solution you’d like I would like OpenWebUI to include an audio transcription feature. Ideally, it should accept audio files (such as MP3, WAV, or other common formats) and convert them into accurate text transcripts. The transcripts should be time-stamped and easily accessible within the interface.
Describe alternatives you’ve considered As an alternative, I’ve explored third-party transcription services based on Whisper with UI (https://github.com/chidiwilliams/buzz , or https://github.com/jhj0517/Whisper-WebUI) but they often come with limitations in installation, sharing, privacy concerns, and additional costs and effort. Having an integrated solution within OpenWebUI would streamline the process and enhance the overall user experience.
Additional context Sometimes, I participate in remote interviews or attend virtual meetings where audio recordings are essential. Having an in-built transcription feature would save time and effort, allowing me to focus on the content rather than manual transcription tasks. When finished I would love to have the ability, to input to a LLM with predefined prompts: eg. "use the following transcript to create a short precise summary in bullet point".