google / lyra

A Very Low-Bitrate Codec for Speech Compression
Apache License 2.0
3.84k stars 355 forks source link

Android example - magic numbers #70

Open bkekelic2 opened 3 years ago

bkekelic2 commented 3 years ago

Hi, I'm currently creating a PTT app which uses encode and decode API in realtime. After implementation enc/dec methods via JNI, I can hear that the decoded voice is slightly distorted, like someone is speaking at 0.9x speed - slower sound. As I'm creating the app, I'm also looking into your example app and there I can see a lot of "Magic" numbers - numbers without any meaning and explanation/comment. So it would be great if you can find any reasonable explanation why you chose some of those numbers in buffers for microphone and speaker, so that I can follow those principles and change my app accordingly to that. I can only presume that they "best fitted".

I'm particularly curious about following numbers:

  1. AudioTrack buffer size
  2. AudioRecord buffer size
  3. micData length

Thanks in advance.

aluebs commented 3 years ago

I am sorry to hear that you are getting distorted audio. Maybe if you share some more details (like, does the audio catch up eventually? Are there clicks? etc) or a recording, we can maybe give you some direction to look into.

Hopefully @mchinen can give some more color to these numbers, but my guesses are:

  1. The AudioTrack (player) is just the same size as the micData, to be able to play out the whole recorded audio. The factor of 2 is just because setBufferSizeInBytes assumes the size in bytes and each short (int16) sample in micData is composed of 2 bytes.
  2. The AudioRecord (record) just needs a big enough buffer read the samples from the mic.
  3. micData is set arbitrarily to be 5 seconds long, so 5 * SAMPLE_RATE (samples per second), plus an additional chunkSize as padding.

Please let me know if you have any additional questions.

bkekelic2 commented 3 years ago

Hi, thanks for your feedback. It really helped me a lot to understand what is going on in background.

Regarding distorted audio, here are my current findings. My current scenario is that one device is sending encoded audio packets over network via UDP and the other side is receiving them, decoding and playing on speaker. Currently, I think that everything is ok with encoding (it takes ca. 1 ms to execute) unlike decoding which takes ca. 55 ms to decode. As Lyra works with 40 ms of data from mic, it means that delay from Lyra should be less or equal than 40 ms so that system can be realtime. Now, it means that for each packet receiver receives it will produce total delay of 15 ms before it got played on speaker. From calculus and logs I can see that I'm able to process 18 packets/s, and that adds 270ms of delay each second. Longer I record from mic the longer it takes on other side to process the output. I tested on 2 mid range devices: Motorola and Xiaomi and both have pretty much the same results (problematic delay for decode function - precisely generating samples).

So my question is, on which devices Lyra can perform realtime operations? Thanks.

aluebs commented 3 years ago

Yes, the generative model in the Lyra decoder is the heaviest component by far, and if it runs slower than real-time, it won't ever sound good. I don't have an exhaustive list, but I know we had successful Lyra calls on all Pixel devices and Samsung Galaxy phones. That said, we are aware that complexity is a major pain-point and are working on a better and, most importantly, less complex SoundStream engine. Stay tuned for updates on that front.

zonesys commented 2 years ago

Hi, thanks for your feedback. It really helped me a lot to understand what is going on in background.

Regarding distorted audio, here are my current findings. My current scenario is that one device is sending encoded audio packets over network via UDP and the other side is receiving them, decoding and playing on speaker. Currently, I think that everything is ok with encoding (it takes ca. 1 ms to execute) unlike decoding which takes ca. 55 ms to decode. As Lyra works with 40 ms of data from mic, it means that delay from Lyra should be less or equal than 40 ms so that system can be realtime. Now, it means that for each packet receiver receives it will produce total delay of 15 ms before it got played on speaker. From calculus and logs I can see that I'm able to process 18 packets/s, and that adds 270ms of delay each second. Longer I record from mic the longer it takes on other side to process the output. I tested on 2 mid range devices: Motorola and Xiaomi and both have pretty much the same results (problematic delay for decode function - precisely generating samples).

So my question is, on which devices Lyra can perform realtime operations? Thanks.

Hi @bkekelic2 : I am also building a PTT app and I was evaluating Lyra .. can you please give me your feedback so far and if it is worth working on it at this point with med range to low end devices .. thanks

bkekelic2 commented 2 years ago

Hi @zonesys, as I sad we tested on few different mid-range phones and the outcome was always pretty much the same - the audio was stretched. So to better visualize that stretching you can imagine that at 5th second of a recording you are hearing 4th second, at 10th second you will hear 8th second, etc etc, delay is just getting bigger and bigger and sound is distorted and stretched. So that didn't fit our requirements. But good point here was that we also tested Lyra on few Samsung devices and the outcome was pretty good - there was no delay at all. So, it just depends what target device is and what are its capabilities. Hope this helped you.

Edit: you can also check this talk: https://youtu.be/7CCGTwmGl6M?t=835

zonesys commented 2 years ago

Hi @zonesys, as I sad we tested on few different mid-range phones and the outcome was always pretty much the same - the audio was stretched. So to better visualize that stretching you can imagine that at 5th second of a recording you are hearing 4th second, at 10th second you will hear 8th second, etc etc, delay is just getting bigger and bigger and sound is distorted and stretched. So that didn't fit our requirements. But good point here was that we also tested Lyra on few Samsung devices and the outcome was pretty good - there was no delay at all. So, it just depends what target device is and what are its capabilities. Hope this helped you.

Edit: you can also check this talk: https://youtu.be/7CCGTwmGl6M?t=835

thanks @bkekelic2 : I guess we have to wait for sometime till there is a new solution for the lower tier devices .. thanks for the above advice .. all the best

aluebs commented 2 years ago

The new Lyra 1.2.0 release is about 5 times less complex than the previous version, if you want to try it out.