candideu commented 2 years ago

Project description

Hello Open-Source-Ideas community!

The idea

A simple, easy-to-use application where users can dictate or upload audio or video files, and an automated transcript is generated. This transcript is synced to the audio track, clickable, and editable, so that users can skip to certain passages and refine the transcript accordingly.

The revised transcript can then be exported as plain text, .srt caption file (and other subtitle formats), .pdf, shareable web page, etc. for further processing.

Users can also provide their own language models, so that the number of possible languages that can be transcribed grows over time, as people create new models.

This application could be something you access from a browser and uses local storage, or a downloadble app (using something like Electron).

Inspiration, and the "Why"

As someone who works a lot with video and audio, and aims to make my work accessible, I'm a big fan of Otter.ai and Sonix.ai. They're very easy to use and provide pretty accurate transcriptions.

Issues, and what's missing in existing tools

That said, Otter and Sonix are not open-source, and their free tiers can be limiting. Both Otter and Sonix offer three lifetime uploads max, and Otter allows 40 minutes of live transcriptions per recording, with a max of 600 minutes a month (no rollover).

Otter only does transcriptions in English. Sonix does offer 37+ languages, but it doesn't look like you can provide your own language models. Other options like YouTube's automated transcriptions offer a wider range of languages, but that involves having to upload the media to YouTube, and there's no clickable transcript option.

Another issue is that some folks use automated transcriptions in their line of work, but cannot use cloud-based, proprietary software for legal reasons (see this Reddit thread).

Relevant Technology

I am in no way an expert, but it seems like Python would be relevant. That said, I'm open to any ideas, and open to having this be an application that's downloaded on your computer (with cross-platform support), or a web application that uses local storage, etc.

Speech-to-text

Vosk Browser

https://user-images.githubusercontent.com/55474996/134832439-86c3f65e-2fd7-4b6e-a7b5-129de2495617.mp4

VOSK Browser is a speech recognition library running in the browser thanks to a WebAssembly build of Vosk. This implementation is probably the one I'm the most excited about because it's very close to what I had in mind. The demo they've created allows you to use your microphone or to upload an audio file to create the transcription. The cool thing about this approach is that you don't need to set up any loopback methods if you are using pre-recorded audio, because the demo seems to do it on its own.

According to the dev, "This project aims just to be a library that wraps a wasm build of vosk and the demo is just a demo of what can be done so I won't be adding such functionalities to the library itself. I have thought of integrating transcription with vosk-browser to oTranscribe which I guess would achieve what you want. I currently have no time for that but maybe someone can pick this up, would be really cool."

Potential ways to build upon this project:

Adding punctuation: I've found a number of punctuation restoration projects on here that could help with that such as punctuator2 and its many forks such as PunkProse. Punctuator2 even has a nifty demo which you can try out here. I also found an implementation of PunkProse + VOSK here.
Making the transcript editable
Adding timings that are synced to the audio (I assume that the live dictation would have to be recorded)
The ability to export the work as a subtitle/caption file

Check out the Demo: https://ccoreilly.github.io/vosk-browser/ View GitHub Repo: https://github.com/ccoreilly/vosk-browser

ideasman42/nerd-dictation

Uses VOSK API, but is for meant for Linux and uses the command-line to be installed. It also doesn't have a clickable transcript

https://user-images.githubusercontent.com/55474996/134835664-fd393fd8-e0a1-4e2c-8cd4-cbb32c7c628c.mp4

Video demo

Source code can be viewed here

saharmor/realtime-transcription-playground

Very similar to what I'm proposing, but uses Google's Speech API, which involves creating a service account and knowing how to use their Cloud Console.

https://user-images.githubusercontent.com/6180201/124362454-370e6600-dc35-11eb-8374-77da5aec25b2.mp4

Source code can be viewed here

STTWebApp

Web Application that uses VOSK to transcribe audios to texts in portuguese. Would be great if users could supply the language model of your choice.

Source code can be viewed here

Clickable, Interactive Transcript

AblePlayer

Able Player is a fully accessible, open-source cross-browser HTML5 media player. It's not a text-to-speech API, but the player has a really neat clickable transcript feature that can be seen in the following example:

Demo #6

https://user-images.githubusercontent.com/55474996/134834845-f3733513-038e-4791-9152-a4fc7bb85c91.mp4

The source code can be viewed here.

Subtitle + Transcript Editors + Previewers

oTranscribe

oTranscribe is one of the more well-known options in this space. It's a tool for manually transcribing audio interviews that allows you to import a video or audio file, and manually type the transcript. You can also add timestamps which can be clicked on to jump to that point in the audio/video. oTranscribe also features great keyboard shortcuts and playback tools to ease the transcription process.

There's even an oTranscribe for Electron fork that could be interesting to look into.

Drawbacks:

No speech-to-text
Cannot export to .srt (although an .otr to .srt conversion is possible with this external tool)
Cannot edit timestamps as text

View the website here: https://otranscribe.com/ View the repo here: https://github.com/oTranscribe

Hyperaudio

Hyperaudio seems to be working on an exciting suite of open interactive transcript tools which allow people to Navigate, Search and Edit transcripts!

I namely want to highlight the following tools, which could be of interest:

Hyperaudio Lite Editor: A lightweight transcript editor for editing and correcting STT generated timed transcripts

https://user-images.githubusercontent.com/55474996/134836728-16ddd72a-77a7-4043-b291-6a470224ee60.mp4

Repo: https://github.com/hyperaudio/hyperaudio-lite-editor

Hyperaudio Lite: a Super-lightweight Interactive Transcript Player

Repo: https://github.com/hyperaudio/hyperaudio-lite

Hyperaudio Converter: converts from JSON/SRT to HTML Based Interactive Transcript

Hyperaudio Website for now: https://lab.hyperaud.io/ Official Website: https://hyper.audio/

All arounders

Kdenlive

The open-source video editor introduced a speech-to-text module in version 21.04 using VOSK, an offline speech-recognition API. That said, the feature is still pretty new and kind of buggy. It also involves having to download Python and knowing how to use Kdenlive. I like the idea of using VOSK's API, but I think having a simple, dedicated application that works out of the box for automated transcriptions would be best, especially for people who aren't tech-savvy.

View their source code here: https://invent.kde.org/multimedia/kdenlive/-/tree/master/data/scripts

Video Transcriber

Video Transcriber is a Computer assisted video/audio transcription which, from what I can gather, seems to be what I have in mind. It's a prototype made with journalists and media professionals in mind.

Unfortunately, the demo link I found seems to be broken, so I haven't been able to test this one out. Testing this project otherwise would involve installing dependencies and creating an IBM Bluemix Account (which has monthly limits). The implementation I had in mind would be easy for non-technical users to use out-of-the-box.

View the repo: https://github.com/glitchdigital/video-transcriber

Complexity and required time

I'm not the most knowledgeable on these frameworks, so please let me know if I should tick other options for the complexity. That said, I'm open to helping with the design of the user interface.

Complexity

[ ] Beginner - This project requires no or little prior knowledge of the technolog(y|ies) specified to contribute to the project
[x] Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
[x] Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)

[ ] Little work - A couple of days
[x] Medium work - A week or two
[x] Much work - The project will take more than a couple of weeks and serious planning is required

boaticus commented 2 years ago

You're looking for something that is locally installed, like Descript, but open source. Am I understanding correctly?

candideu commented 2 years ago

You're looking for something that is locally installed, like Descript, but open source. Am I understanding correctly?

Yes, something like that! Descript definitely has a lot more cool features than what I'm proposing here, but it fits. I'd also like the option of adding your own language models (I'm only familiar with VOSK's in that regards.)

lauramzarescu commented 2 years ago

Is this project still active?

candideu commented 2 years ago

Is this project still active?

@lauramzarescu It hasn't actually started. I just pitched the idea here.

candideu commented 2 years ago

UPDATE: I found this project on GitHub called Vosk Browser: https://ccoreilly.github.io/vosk-browser/. It uses the VOSK model and allows people to record audio or upload a file and it transcribes it. Pretty cool!

It could be a great starting point for this project. Would just need a way to edit the text, click on the text to jump to the segment in the audio, and an export function.

https://user-images.githubusercontent.com/55474996/134832439-86c3f65e-2fd7-4b6e-a7b5-129de2495617.mp4

Yahweasel commented 2 years ago

In the interest of self-promotion, I'd like to mention that I plan on integrating transcription-based audio editing into my browser-based audio editor Ennuizel. Ennuizel itself expects transcription to come from elsewhere, however—in my use case, vosk-browser and vosk, but done at a separate time.

candideu commented 2 years ago

In the interest of self-promotion, I'd like to mention that I plan on integrating transcription-based audio editing into my browser-based audio editor Ennuizel. Ennuizel itself expects transcription to come from elsewhere, however—in my use case, vosk-browser and vosk, but done at a separate time.

@Yahweasel Great! Is there a demo that people can try, or a video walk through to preview the app?

Yahweasel commented 2 years ago

In the interest of self-promotion, I'd like to mention that I plan on integrating transcription-based audio editing into my browser-based audio editor Ennuizel. Ennuizel itself expects transcription to come from elsewhere, however—in my use case, vosk-browser and vosk, but done at a separate time.

@Yahweasel Great! Is there a demo that people can try, or a video walk through to preview the app?

Ennuizel is usable at https://ennuizel.github.io , but like I said, it expects transcription info from elsewhere. The transcription-based editing is currently just a plan, but I have all the timed-caption infrastructure in place, so it's really just a matter of doing it.

G2G2G2G commented 2 years ago

the only open datasets that exist: https://commonvoice.mozilla.org/en doesn't even seen to be discussed in here, but this is step #1

candideu commented 2 years ago

the only open datasets that exist: https://commonvoice.mozilla.org/en doesn't even seen to be discussed in here, but this is step #1

The Common Voice site led me to DeepSpeech. I'm not sure if they are the same, but Mozilla DeepSpeech appears to have been abandonned by Mozilla. I've seen several posts encouraging people to use coqui.ai instead.

https://discourse.mozilla.org/t/why-you-should-move-from-deepspeech-to-coqui-ai/82798

https://news.ycombinator.com/item?id=26813616

How does Common Voice/DeepSpeech compare to VOSK?

G2G2G2G commented 2 years ago

As I said common voice is the datasets...

Deepspeech was mozillas ai that used it, and yes the people that want to progress with it have moved to coqui.ai.. I don't see anywhere that VOSK supplies datasets so how it compares.. well one exists.... the other doesn't?

They're several hundred gb total for all languages and estimated not even 1% of dialects completed overall. The new compressed english ones are 65GB

Accent
23%
United States English
8%
England English
7%
India and South Asia (India, Pakistan, Sri Lanka)
3%
Australian English
3%
Canadian English
2%
Scottish English
1%
Irish English
1%
Southern African (South Africa, Zimbabwe, Namibia)
1%
New Zealand English

Part of the reason why even giant corporations fail to do voice to text, still, we're lacking massive amounts of data. USA sees it as much less of an issue, but anyone else with an english accent knows how terrible voice to text is. Too bad it isn't a global collaborative effort.

candideu commented 1 year ago

@gullabi has built a fork of oTranscribe which merges with Vosk Browser:

https://github.com/oTranscribe/oTranscribe/issues/107#issuecomment-1289004631

Concerning this issue, and as @candideu suggested we have developed our own fork to implement this functionality, and it can be found here.

In a nutshell, we have integrated the vosk-browser functionality in oTranscribe and additionally added automated timestamp feature. It can transcribe any file introduced from the file system with timestamps put for each minute. Since vosk-browser works offline, no file is communicated with outside and everything is done with the resources of the local machine.

We keep our repository as a fork since our intention is to introduce the changes to the original repo, if the maintainer is interested.

Finally, we will provide with a publicly available deployed version of the app and a desktop version soon. But in the meantime we appreciate any help, QA or suggestions from the community.

candideu commented 1 year ago

Update: it looks like Subtitle Edit is the answer to my request.

It's FOSS and recently added a speech-to-text function using both Vosk and Open AI's Whisper.

Here's a tutorial demo: https://youtu.be/InsNe0KjFhg

The downside: it's only available for Windows.

candideu commented 1 year ago

Here's another update: there has been an explosion of free, multilingual speech-to-text tools thanks to Whisper. You can check out the show case here: https://github.com/openai/whisper/discussions/categories/show-and-tell

open-source-ideas / ideas

AI Scribe: Automated Speech-to-Text Transcriptions and Captions | Inspired by Otter.ai and Sonix.ai #288

Project description

The idea

Inspiration, and the "Why"

Issues, and what's missing in existing tools

Relevant Technology

Speech-to-text

Vosk Browser

ideasman42/nerd-dictation

saharmor/realtime-transcription-playground

STTWebApp

Clickable, Interactive Transcript

AblePlayer

Subtitle + Transcript Editors + Previewers

oTranscribe

Hyperaudio

All arounders

Kdenlive

Video Transcriber

Complexity and required time

Complexity

Required time (ETA)

Categories