Closed candideu closed 1 year ago
You're looking for something that is locally installed, like Descript, but open source. Am I understanding correctly?
You're looking for something that is locally installed, like Descript, but open source. Am I understanding correctly?
Yes, something like that! Descript definitely has a lot more cool features than what I'm proposing here, but it fits. I'd also like the option of adding your own language models (I'm only familiar with VOSK's in that regards.)
Is this project still active?
Is this project still active?
@lauramzarescu It hasn't actually started. I just pitched the idea here.
UPDATE: I found this project on GitHub called Vosk Browser: https://ccoreilly.github.io/vosk-browser/. It uses the VOSK model and allows people to record audio or upload a file and it transcribes it. Pretty cool!
It could be a great starting point for this project. Would just need a way to edit the text, click on the text to jump to the segment in the audio, and an export function.
In the interest of self-promotion, I'd like to mention that I plan on integrating transcription-based audio editing into my browser-based audio editor Ennuizel. Ennuizel itself expects transcription to come from elsewhere, however—in my use case, vosk-browser and vosk, but done at a separate time.
In the interest of self-promotion, I'd like to mention that I plan on integrating transcription-based audio editing into my browser-based audio editor Ennuizel. Ennuizel itself expects transcription to come from elsewhere, however—in my use case, vosk-browser and vosk, but done at a separate time.
@Yahweasel Great! Is there a demo that people can try, or a video walk through to preview the app?
In the interest of self-promotion, I'd like to mention that I plan on integrating transcription-based audio editing into my browser-based audio editor Ennuizel. Ennuizel itself expects transcription to come from elsewhere, however—in my use case, vosk-browser and vosk, but done at a separate time.
@Yahweasel Great! Is there a demo that people can try, or a video walk through to preview the app?
Ennuizel is usable at https://ennuizel.github.io , but like I said, it expects transcription info from elsewhere. The transcription-based editing is currently just a plan, but I have all the timed-caption infrastructure in place, so it's really just a matter of doing it.
the only open datasets that exist: https://commonvoice.mozilla.org/en doesn't even seen to be discussed in here, but this is step #1
the only open datasets that exist: https://commonvoice.mozilla.org/en doesn't even seen to be discussed in here, but this is step #1
The Common Voice site led me to DeepSpeech. I'm not sure if they are the same, but Mozilla DeepSpeech appears to have been abandonned by Mozilla. I've seen several posts encouraging people to use coqui.ai instead.
https://discourse.mozilla.org/t/why-you-should-move-from-deepspeech-to-coqui-ai/82798
https://news.ycombinator.com/item?id=26813616
How does Common Voice/DeepSpeech compare to VOSK?
As I said common voice is the datasets...
Deepspeech was mozillas ai that used it, and yes the people that want to progress with it have moved to coqui.ai.. I don't see anywhere that VOSK supplies datasets so how it compares.. well one exists.... the other doesn't?
They're several hundred gb total for all languages and estimated not even 1% of dialects completed overall. The new compressed english ones are 65GB
Accent
23%
United States English
8%
England English
7%
India and South Asia (India, Pakistan, Sri Lanka)
3%
Australian English
3%
Canadian English
2%
Scottish English
1%
Irish English
1%
Southern African (South Africa, Zimbabwe, Namibia)
1%
New Zealand English
Part of the reason why even giant corporations fail to do voice to text, still, we're lacking massive amounts of data. USA sees it as much less of an issue, but anyone else with an english accent knows how terrible voice to text is. Too bad it isn't a global collaborative effort.
@gullabi has built a fork of oTranscribe which merges with Vosk Browser:
https://github.com/oTranscribe/oTranscribe/issues/107#issuecomment-1289004631
Concerning this issue, and as @candideu suggested we have developed our own fork to implement this functionality, and it can be found here.
In a nutshell, we have integrated the vosk-browser functionality in oTranscribe and additionally added automated timestamp feature. It can transcribe any file introduced from the file system with timestamps put for each minute. Since vosk-browser works offline, no file is communicated with outside and everything is done with the resources of the local machine.
We keep our repository as a fork since our intention is to introduce the changes to the original repo, if the maintainer is interested.
Finally, we will provide with a publicly available deployed version of the app and a desktop version soon. But in the meantime we appreciate any help, QA or suggestions from the community.
Update: it looks like Subtitle Edit is the answer to my request.
It's FOSS and recently added a speech-to-text function using both Vosk and Open AI's Whisper.
Here's a tutorial demo: https://youtu.be/InsNe0KjFhg
The downside: it's only available for Windows.
Here's another update: there has been an explosion of free, multilingual speech-to-text tools thanks to Whisper. You can check out the show case here: https://github.com/openai/whisper/discussions/categories/show-and-tell
Project description
Hello Open-Source-Ideas community!
The idea
A simple, easy-to-use application where users can dictate or upload audio or video files, and an automated transcript is generated. This transcript is synced to the audio track, clickable, and editable, so that users can skip to certain passages and refine the transcript accordingly.
The revised transcript can then be exported as plain text, .srt caption file (and other subtitle formats), .pdf, shareable web page, etc. for further processing.
Users can also provide their own language models, so that the number of possible languages that can be transcribed grows over time, as people create new models.
This application could be something you access from a browser and uses local storage, or a downloadble app (using something like Electron).
Inspiration, and the "Why"
As someone who works a lot with video and audio, and aims to make my work accessible, I'm a big fan of Otter.ai and Sonix.ai. They're very easy to use and provide pretty accurate transcriptions.
Issues, and what's missing in existing tools
That said, Otter and Sonix are not open-source, and their free tiers can be limiting. Both Otter and Sonix offer three lifetime uploads max, and Otter allows 40 minutes of live transcriptions per recording, with a max of 600 minutes a month (no rollover).
Otter only does transcriptions in English. Sonix does offer 37+ languages, but it doesn't look like you can provide your own language models. Other options like YouTube's automated transcriptions offer a wider range of languages, but that involves having to upload the media to YouTube, and there's no clickable transcript option.
Another issue is that some folks use automated transcriptions in their line of work, but cannot use cloud-based, proprietary software for legal reasons (see this Reddit thread).
Relevant Technology
I am in no way an expert, but it seems like Python would be relevant. That said, I'm open to any ideas, and open to having this be an application that's downloaded on your computer (with cross-platform support), or a web application that uses local storage, etc.
Speech-to-text
Vosk Browser
https://user-images.githubusercontent.com/55474996/134832439-86c3f65e-2fd7-4b6e-a7b5-129de2495617.mp4
VOSK Browser is a speech recognition library running in the browser thanks to a WebAssembly build of Vosk. This implementation is probably the one I'm the most excited about because it's very close to what I had in mind. The demo they've created allows you to use your microphone or to upload an audio file to create the transcription. The cool thing about this approach is that you don't need to set up any loopback methods if you are using pre-recorded audio, because the demo seems to do it on its own.
According to the dev, "This project aims just to be a library that wraps a wasm build of vosk and the demo is just a demo of what can be done so I won't be adding such functionalities to the library itself. I have thought of integrating transcription with vosk-browser to oTranscribe which I guess would achieve what you want. I currently have no time for that but maybe someone can pick this up, would be really cool."
Potential ways to build upon this project:
Check out the Demo: https://ccoreilly.github.io/vosk-browser/ View GitHub Repo: https://github.com/ccoreilly/vosk-browser
ideasman42/nerd-dictation
Uses VOSK API, but is for meant for Linux and uses the command-line to be installed. It also doesn't have a clickable transcript
https://user-images.githubusercontent.com/55474996/134835664-fd393fd8-e0a1-4e2c-8cd4-cbb32c7c628c.mp4
Video demo
Source code can be viewed here
saharmor/realtime-transcription-playground
Very similar to what I'm proposing, but uses Google's Speech API, which involves creating a service account and knowing how to use their Cloud Console.
https://user-images.githubusercontent.com/6180201/124362454-370e6600-dc35-11eb-8374-77da5aec25b2.mp4
Source code can be viewed here
STTWebApp
Web Application that uses VOSK to transcribe audios to texts in portuguese. Would be great if users could supply the language model of your choice.
Source code can be viewed here
Clickable, Interactive Transcript
AblePlayer
Able Player is a fully accessible, open-source cross-browser HTML5 media player. It's not a text-to-speech API, but the player has a really neat clickable transcript feature that can be seen in the following example:
https://user-images.githubusercontent.com/55474996/134834845-f3733513-038e-4791-9152-a4fc7bb85c91.mp4
The source code can be viewed here.
Subtitle + Transcript Editors + Previewers
oTranscribe
oTranscribe is one of the more well-known options in this space. It's a tool for manually transcribing audio interviews that allows you to import a video or audio file, and manually type the transcript. You can also add timestamps which can be clicked on to jump to that point in the audio/video. oTranscribe also features great keyboard shortcuts and playback tools to ease the transcription process.
There's even an oTranscribe for Electron fork that could be interesting to look into.
Drawbacks:
View the website here: https://otranscribe.com/ View the repo here: https://github.com/oTranscribe
Hyperaudio
Hyperaudio seems to be working on an exciting suite of open interactive transcript tools which allow people to Navigate, Search and Edit transcripts!
I namely want to highlight the following tools, which could be of interest:
Hyperaudio Lite Editor: A lightweight transcript editor for editing and correcting STT generated timed transcripts
https://user-images.githubusercontent.com/55474996/134836728-16ddd72a-77a7-4043-b291-6a470224ee60.mp4
Hyperaudio Lite: a Super-lightweight Interactive Transcript Player
Hyperaudio Converter: converts from JSON/SRT to HTML Based Interactive Transcript
Hyperaudio Website for now: https://lab.hyperaud.io/ Official Website: https://hyper.audio/
All arounders
Kdenlive
The open-source video editor introduced a speech-to-text module in version 21.04 using VOSK, an offline speech-recognition API. That said, the feature is still pretty new and kind of buggy. It also involves having to download Python and knowing how to use Kdenlive. I like the idea of using VOSK's API, but I think having a simple, dedicated application that works out of the box for automated transcriptions would be best, especially for people who aren't tech-savvy.
View their source code here: https://invent.kde.org/multimedia/kdenlive/-/tree/master/data/scripts
Video Transcriber
Video Transcriber is a Computer assisted video/audio transcription which, from what I can gather, seems to be what I have in mind. It's a prototype made with journalists and media professionals in mind.
Unfortunately, the demo link I found seems to be broken, so I haven't been able to test this one out. Testing this project otherwise would involve installing dependencies and creating an IBM Bluemix Account (which has monthly limits). The implementation I had in mind would be easy for non-technical users to use out-of-the-box.
View the repo: https://github.com/glitchdigital/video-transcriber
Complexity and required time
I'm not the most knowledgeable on these frameworks, so please let me know if I should tick other options for the complexity. That said, I'm open to helping with the design of the user interface.
Complexity
Required time (ETA)
Categories