Live Transcription feature

alanmilinovic commented 1 year ago

It would be cool to have live transcription feature, similar to Microsoft Teams. You could have live lyrics for example when playing music via youtube in browser and have fun with your friends.

m1k1o commented 1 year ago

Would be maybe as integration for existing service that needs API key. Or maybe there are open-source solutions for this.

alanmilinovic commented 1 year ago

Would be maybe as integration for existing service that needs API key. Or maybe there are open-source solutions for this.

This is a great project!

https://github.com/openai/whisper

alanmilinovic commented 1 year ago

@m1k1o did you find some time to check link I provided in my last comment? Would it be something potentially recognized as a PR?

m1k1o commented 1 year ago

Yes. Sure, as a plug-in. Do you want to create a PR? I can help you with it. Just create working PoC and I can clean it up and integrate.

alanmilinovic commented 1 year ago

Sounds good, will do my best!

alanmilinovic commented 1 year ago

Can you give me some guidance? When you say plugin, I guess it should be added in client part?

m1k1o commented 1 year ago

I would say, it needs to be in server as well. Plugin part does not need to confuse you, it can be hardcoced for PoC. I meant, that it should be self contained piece of code that can be turned on/off if needed.

Starting with getting the audio from neko. I could be either:

Done as client only feature, getting audio from <video> element or direcly from WebRTC track.
Hooking onto neko internals audio, and multiplexing samples to other standalone piece of software that does the audio recognition.
Or starting it as application in supervisord.conf, it would use system output from pulseaudio.

The last option should be the easisest. You want to see if that software:

Can be started in non daemon mode in supervisord.
Can receive audio from pulseaudio output.
Can push recoginzed text somewhere (either over WebSockets, unix sockets or writing it to a file).

And then having this feature would be only matter of starting it and stopping it. Displaying text in GUI can be done in second step, first we only want to get the text in any from.

Hope this helps!

alanmilinovic commented 1 year ago

So far I managed to add a Whisper to the google-chrome server image, where I am testing it.

There is a way to use Whisper in command line or python. I am just not sure how to get pulseaudio output. There are a lot of examples on the Whisper github but it is to complicated for me.

m1k1o commented 1 year ago

Thats good start!

I only found examples with files. Can it work with live sources? Process like a pipeline?

I see ffmpeg is being used under the hood. That could capture pulseaudio.

alanmilinovic commented 1 year ago

This is what I am reading at the moment, might help maybe.

https://github.com/openai/whisper/discussions/2

alanmilinovic commented 1 year ago

Thats good start!

I only found examples with files. Can it work with live sources? Process like a pipeline?

I see ffmpeg is being used under the hood. That could capture pulseaudio.

Do you know how to get input device name of pulseaudio? I am not getting list inside container, when trying to list them.

alanmilinovic commented 1 year ago

I will leave the issue opened, maybe someone else can jump in. From what I learned it should be all possible, but I have weak knowledge to finish the coding. There are multiple ways for sure and it all looks feasible.

m1k1o commented 1 year ago

For gstreamer we use auto_null.monitor, should work also for ffmpeg, i'd say.

alanmilinovic commented 1 year ago

auto_null.monitor

I am getting auto_null.monitor: No such process if I run ffmpeg -f pulse -i auto_null.monitor out.mp3 inside container. Maybe because pulseaudio is not running or I am doing it in wrong place?

I also get this message No PulseAudio daemon running, or not running as session daemon. when running simple pacmd command.

m1k1o / neko

Live Transcription feature #218