gitmylo / audio-webui

A webui for different audio related Neural Networks
MIT License
973 stars 90 forks source link

[FEATURE REQUEST] Whisper QOL Enhancements #161

Open revolvedai opened 9 months ago

revolvedai commented 9 months ago

Is your feature request related to a problem? Please describe. Audio Webui is something I have always wanted, and I'm so hyped to see it open source! I have some improvements related to Whisper which are mostly due to my day dreaming about how I would make it (and some half hearted attempts to make it with GPT4 via batch file)

There are some usability and QoL improvements that could be done to the Whisper tab. For context, I'm with RunDiffusion but this is mostly related to user feedback from my own use.

Describe the solution you'd like At RunDiffusion we prefer to specify paths vs use drop boxes. The "upload" functionality actually works great on RD, which is a massive improvement over most gradio platforms with drop boxes! However it would be ideal to simply specify a path to be processed via batch. Especially since paths on our system mostly look like /mnt/private/audiobatchexample/ etc. Because of the browser upload, I'm hesitant to upload more than 10 audio files at a time, concerned about browser issues. Specifying a path for large amounts of file would be easiest. Drop boxes are good for local and when you need to do one a time, but not great for remote servers and large numbers of files.

A "Download All" for batch whisper transcription outputs would be amazingly useful. Downloading them one at a time is not great. It would also be nice if they would automatically output to a directory, especially one that could be specified via settings or on boot.

An underscore after the file name of the outputs would make them sortable. For example, I had a large number of files and they would not be sorted into the correct order when downloaded because a file like 1249.wav would come out as 12493eiffj3jmcfmfmw32.txt - if it had the underscore, it would properly sort alphabetically. 1249.wav would be above 1249_3effj3jmcfmfmw32.txt

Describe alternatives you've considered I'm interested in the language capabilities of Whisper as well, not sure how that fits into the Webui you have created for Whisper.

Additional context Overall I'm super happy that you have included Whisper and am enthusiastic about all the tools.