Discussion about Low bit rate neural audio codecs

rafael2k commented 2 years ago

I'd say as a second step I would re-train the LPCNet NN, as it is easy to do it, with wide availability of documentation: https://github.com/xiph/LPCNet

Of course... the first step is a data-set with the relevant samples. May be an open website for submission of samples?

trholding commented 2 years ago

@rafael2k I agree. Meanwhile I have samples

original
lyra based with our modifications and audio super resolution pipeline variable bitrate from 1500bits to 1600bits per second
lpcnet based with our modifications and audio super resolution pipeline 800 bits per second

Can't release it yet cos still in R&D (basically means code is embarrassingly hacky), mix of python c/c++ , have to fine tune audio processing, need more audio samples which are real life recordings to do that, need to convert everything into a single C/C++ codebase. Near time not yet real time on CPU so have to work hard on code rewrite. Will be released as opensource.

I will be dropping further work and research on lyra as their model training is not open / available and bit rate is higher and I am not sure if they have hidden patents or something like that.

My further goal is to reduce bitrate to 400 bits and still have good audio output - I tried this but surprisingly the result sounded like a mix of asian russian german french languages, weird. And you are right, models need to be retrained for both lpcnet to match speech type and rates and the enhancement models have to be retrained for this to happen. Sadly I do not have access to long time (week / month training) big / gpu compute. So an open website sounds good, also need to collect a list of all online open audio datasets without weird licenses.

I am very pleased with lpcnet as core, the encode is fast, but the decode is slower, but still much faster than real time. The neural enhancement is the part which is lagging at less than real time, so real time speech is not feasible, but after rewrite to C/C++ it will be. Also noted that any clipping in audio adversely affects the final enhancement, information is lost to clipping in such a way that inference gets the result wrong.

newresults.zip

What are your thoughts on whisper? I am thinking of integrating it as an option so that transcription could be sent together with voice, but the transcriptions would lag behind a few secs to 10 secs, also means more CPU is burned.

trholding commented 2 years ago

One thing is for sure, in the audio pipeline I am losing some vital information in the audio pre process that adversly affects final outputs, somewhere clipping is introduced. Need to figure it out.

Also need to find out if fountain codes still have patents, if not that along with a inner RS code could be used for the final stream so that error correction is possible. That will be another program.

rafael2k commented 2 years ago

One enhancement I'm following closely in LPCNet is PLC (Packet Loss Concealment) which may help us in case of lost packets. The PLC features were commited [1] in the past months and they might be useful to the use case we are specifying. Did not tested it yet, I just pulled upstream to my fork.

[1] https://github.com/xiph/LPCNet

rafael2k commented 2 years ago

Concerning CPU / GPU access, if you need one to use, I can pass you a login of a computer at the university.

rafael2k commented 2 years ago

I have listened the new tests - and I agree to focus on LPCNet first. There will be new ML based codecs released soon I believe, as this is a very hot research topic, so I'd leave Lyra aside, as it is not state-of-the-art, nor free software friendly. LPCNet really has a community around already (you can see, for example, David Rowe's [2] fork and other), very different than Lyra...

[2] https://github.com/drowe67/LPCNet/

trholding commented 2 months ago

Thank you for the conversations. Closing as other options have emerged.

rafael2k commented 2 months ago

Hi @trholding! Could you share the other options you have?

I got to know recently about: https://github.com/facebookresearch/encodec

trholding commented 2 months ago

@rafael2k encodec is the last one which had good bitrate vs quality

Now today I came across this: https://haoheliu.github.io/SemantiCodec/

trholding commented 2 months ago

@rafael2k encodec is the last one which had good bitrate vs quality

Now today I came across this: https://haoheliu.github.io/SemantiCodec/

https://github.com/haoheliu/SemantiCodec-inference

0.31 kbps sounds pretty good.

We could check if this if it is hackable. A trick would be buffer 2 frames of audio speed it up to fit one frame and tx, at Rx reverse it but there would be 2x latency per frame. But avg bit rate would be 0.15 kbps on air/wire.

If you find anything better let me know. I am interested to participate to port a good such codec to portable C if there is interest.

But what I am really after is like the holy grail of codecs, something like 100 bytes per second.

I have been dabling with speech to text and text to speech with voice cloning, all that was a slow and dead end.

A lot has changed with regard to neural codecs since we first started conversation.

I think it is just a matter of time the 100bps mark for voice is hit.

The only problem is that the research teams that develop them have to release something as MIT/BSD licence and not with restrictive research licences. The other problems the projects are usually in python with the torch and gigabyte dependency baggage. So if we find such a suitable codec that we can hack to reduce bitrates further, we could consider porting it to portable C.

I envision a 1-10mb executable with inbuilt model to do decoding and encoding of speech in real-time.

It should be built the Unix way where we could pipe it to other programs.

Also it should be running reasonably fast (real time) on embedded systems or resource constrained devices.

An unrelated example is whisper.cpp, it's fast enough to run faster than real-time. I believe a codec would not need that much of compute.

rafael2k commented 2 months ago

This SemantiCodec/ is really something!! Yay! I'll do some tests and see how far we are from realtime in a Raspberry Pi or similar hardware.

rafael2k commented 2 months ago

Btw, concerning to C port, I agree. Just like what Fabrice did with TSAC: https://bellard.org/tsac/

trholding commented 2 months ago

I was about to post that here!

https://bellard.org/tsac/ https://github.com/descriptinc/descript-audio-codec

trholding commented 2 months ago

https://en.wikipedia.org/wiki/FIPS_137 https://archive.ph/20130223071802/http://link.aip.org/link/?JASMAN/103/2778/1

rafael2k commented 2 months ago

Indeed. Concerning old military vocoder codecs, they have communication quality and are definitely a baseline for any comparison.

trholding commented 1 month ago

https://huggingface.co/kyutai/mimi 1kbps

rafael2k commented 1 month ago

This is really a break-through! Written in rust, I think it can work in realtime! I'll test here.

trholding / audiosamples

Discussion about Low bit rate neural audio codecs #1