In what configuration is the Soundstream in Lyra V2 trained?

google / lyra

A Very Low-Bitrate Codec for Speech Compression

Apache License 2.0

3.8k stars 354 forks source link

In what configuration is the Soundstream in Lyra V2 trained? #102

Open yd11111 opened 1 year ago

yd11111 commented 1 year ago

Referring to the original Soundstream article, Soundstream should be trained on 24kHz data. I would like to know what sample rate wavs these models released in lyraV2 (soundstream_encoder.tflite; quantizer.tflite; lyragan.tflite) were trained on. Can these models also support processing 24kHz wavs? Could these models be used on 24kHz wavs to do some interesting experiments similar to another Google work AudioLM.

I found that the existing models seem to be processing 16kHz wavs. However, I found in 48 line in lyra_encoder.h the supported sample rates are not only 16000, but also 8000, 32000, and 48000. This makes me confused. Different sample rate means that the fixed 320 samples vary in the different time span. I'm not quite sure if this fixed soundstream_encoder can directly handle data of different sample rates. Because given 46 4bit quantizers, the encoded data is not the supported bit rates (9.2kbps) mentioned in the API doc. Actually, I use the three released models to encode, quantize and decode a 16Khz and a 24Khz wav with the same content, the two decode waves sound like the same. Due to the limitation of the num of test examples, I am not sure about the recovery quality. Can anyone explain this? Much thanks.

pinilpypinilpy commented 1 year ago

The encoder automatically resamples the input, and for some reason the decoder does the same to the output, to 16khz by default. I modified it to work with 24khz, and got different results than 16. Maybe because of the way I did it, the bitrate increased as well

aluebs commented 1 year ago

You right, the TFLite models only support 16kHz. The Lyra API supports 8kHz, 16kHz, 32kHz and 48kHz, resampling to 16kHz at the encoder and from 16kHz at the decoder if needed. The desired bitrate can be set completely independently from the sample rate. The supported bitrates are 3.2kbps, 6kbps and 9.2kbps.

yd11111 commented 1 year ago

Thanks for your answer, now I figure it out.

berserker1 commented 1 year ago

@aluebs but is it possible for lyra to support stereo audio format also?

pinilpypinilpy commented 1 year ago

It is, you can set the number of channels in lyra_config.cc I think. Doing so doubles the file size (and encoding time?).

berserker1 commented 1 year ago

@pinilpypinilpy I was able to find the required variable as kNumChannels, its value is set in lyra_confiig.cc as 1.

However in lyra_config.h they are used as extern int using the namespace codec with the following comment

This file is reserved for non-configurable values needed by both the decoder
and encoder. What those non-configurable values are depends on which project
is chosen to be compiled.  As a result, a struct holding the configuration
data is defined to ensure each new target added and each new configuration
element is explicitly defined.

So I am not sure if one should directly change the parameters in that file. There are other parameters also kNumFeatures kNumMelBins kFrameRate kOverlapFactor, I do not think others need to be changed right?

aluebs commented 1 year ago

At this time, Lyra doesn't support stereo.

pinilpypinilpy commented 1 year ago

@berserker1 you only need to change the other values if you're using a different sampling rate. If you change kNumChannels to 2 and recompile, your input file will have to be stereo, and the decoded file will also be.

It isn't technically supported though. If you want to play around, I forked Lyra and added support for other sampling rates and bitrate presets, as well as stereo without needing to recompile: https://github.com/pinilpypinilpy/lyra-variable

However, the devs disabled those things for a reason, so YMMV

berserker1 commented 1 year ago

@pinilpypinilpy Yes I followed exactly what you said, changing the variable and inputing a stereo file worked fine (it encoded it and decoded it smoothly), thanks for sharing your forked repo!

As a novice myself I do not quite get why this small feature is not there and it is disabled?