Closed trholding closed 2 months ago
@rafael2k I agree. Meanwhile I have samples
Can't release it yet cos still in R&D (basically means code is embarrassingly hacky), mix of python c/c++ , have to fine tune audio processing, need more audio samples which are real life recordings to do that, need to convert everything into a single C/C++ codebase. Near time not yet real time on CPU so have to work hard on code rewrite. Will be released as opensource.
I will be dropping further work and research on lyra as their model training is not open / available and bit rate is higher and I am not sure if they have hidden patents or something like that.
My further goal is to reduce bitrate to 400 bits and still have good audio output - I tried this but surprisingly the result sounded like a mix of asian russian german french languages, weird. And you are right, models need to be retrained for both lpcnet to match speech type and rates and the enhancement models have to be retrained for this to happen. Sadly I do not have access to long time (week / month training) big / gpu compute. So an open website sounds good, also need to collect a list of all online open audio datasets without weird licenses.
I am very pleased with lpcnet as core, the encode is fast, but the decode is slower, but still much faster than real time. The neural enhancement is the part which is lagging at less than real time, so real time speech is not feasible, but after rewrite to C/C++ it will be. Also noted that any clipping in audio adversely affects the final enhancement, information is lost to clipping in such a way that inference gets the result wrong.
What are your thoughts on whisper? I am thinking of integrating it as an option so that transcription could be sent together with voice, but the transcriptions would lag behind a few secs to 10 secs, also means more CPU is burned.
One thing is for sure, in the audio pipeline I am losing some vital information in the audio pre process that adversly affects final outputs, somewhere clipping is introduced. Need to figure it out.
Also need to find out if fountain codes still have patents, if not that along with a inner RS code could be used for the final stream so that error correction is possible. That will be another program.
One enhancement I'm following closely in LPCNet is PLC (Packet Loss Concealment) which may help us in case of lost packets. The PLC features were commited [1] in the past months and they might be useful to the use case we are specifying. Did not tested it yet, I just pulled upstream to my fork.
Concerning CPU / GPU access, if you need one to use, I can pass you a login of a computer at the university.
I have listened the new tests - and I agree to focus on LPCNet first. There will be new ML based codecs released soon I believe, as this is a very hot research topic, so I'd leave Lyra aside, as it is not state-of-the-art, nor free software friendly. LPCNet really has a community around already (you can see, for example, David Rowe's [2] fork and other), very different than Lyra...
Thank you for the conversations. Closing as other options have emerged.
Hi @trholding! Could you share the other options you have?
I got to know recently about: https://github.com/facebookresearch/encodec
@rafael2k encodec is the last one which had good bitrate vs quality
Now today I came across this: https://haoheliu.github.io/SemantiCodec/
@rafael2k encodec is the last one which had good bitrate vs quality
Now today I came across this: https://haoheliu.github.io/SemantiCodec/
https://github.com/haoheliu/SemantiCodec-inference
0.31 kbps sounds pretty good.
We could check if this if it is hackable. A trick would be buffer 2 frames of audio speed it up to fit one frame and tx, at Rx reverse it but there would be 2x latency per frame. But avg bit rate would be 0.15 kbps on air/wire.
If you find anything better let me know. I am interested to participate to port a good such codec to portable C if there is interest.
But what I am really after is like the holy grail of codecs, something like 100 bytes per second.
I have been dabling with speech to text and text to speech with voice cloning, all that was a slow and dead end.
A lot has changed with regard to neural codecs since we first started conversation.
I think it is just a matter of time the 100bps mark for voice is hit.
The only problem is that the research teams that develop them have to release something as MIT/BSD licence and not with restrictive research licences. The other problems the projects are usually in python with the torch and gigabyte dependency baggage. So if we find such a suitable codec that we can hack to reduce bitrates further, we could consider porting it to portable C.
I envision a 1-10mb executable with inbuilt model to do decoding and encoding of speech in real-time.
It should be built the Unix way where we could pipe it to other programs.
Also it should be running reasonably fast (real time) on embedded systems or resource constrained devices.
An unrelated example is whisper.cpp, it's fast enough to run faster than real-time. I believe a codec would not need that much of compute.
This SemantiCodec/ is really something!! Yay! I'll do some tests and see how far we are from realtime in a Raspberry Pi or similar hardware.
Btw, concerning to C port, I agree. Just like what Fabrice did with TSAC: https://bellard.org/tsac/
I was about to post that here!
https://bellard.org/tsac/ https://github.com/descriptinc/descript-audio-codec
Indeed. Concerning old military vocoder codecs, they have communication quality and are definitely a baseline for any comparison.
This is really a break-through! Written in rust, I think it can work in realtime! I'll test here.
I'd say as a second step I would re-train the LPCNet NN, as it is easy to do it, with wide availability of documentation: https://github.com/xiph/LPCNet
Of course... the first step is a data-set with the relevant samples. May be an open website for submission of samples?