google / lyra

A Very Low-Bitrate Codec for Speech Compression
Apache License 2.0
3.8k stars 354 forks source link

Noise is being added to generated speech in Python E2E flow (TFLite Models) #135

Open barrylee111 opened 8 months ago

barrylee111 commented 8 months ago

Description

I am currently working on a project that is built in Unity where I am modulating voices (e.g. source speech → voice modulator → target speech (elf)). I currently have an E2E flow with the TFLite models, but there is a decent amount of noise being added to the speech generation. It sounds almost like a clipping noise. I'm currently using the TFLite models from the repo and I have split the quantizer into a QuantizerEncoder & QuantizerDecoder. I'm not sure if a better solution is to attempt to convert Lyra into a DLL and just run that in Unity vs the models, but this is what I have so far.

E2E Flow

Code

!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/soundstream_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/lyragan.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_decoder.tflite

import tensorflow as tf
import numpy as np

import librosa

def getAudioData(audio_file, verbose=False):
    data, sr = librosa.load(audio_file, sr=None)

    if verbose:
        print(len(y))

    batch_size = 320
    padding_length = batch_size - (len(data) % batch_size)
    padded_data = np.pad(data, (0, padding_length), mode='constant', constant_values=0)

    return padded_data, sr

# Encoder:
def runEncoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Extract the first 320 samples
    input_data = np.array(input_data, dtype=input_details[0]['dtype'])
    input_data = np.reshape(input_data, (1, 320))

    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])

    if verbose:
        print(output_data)

    return output_data

# Quantizer Encoder:
def runQuantizerEncoderInference(input_data2, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    input_data1 = np.array(46, dtype=np.int32)
    interpreter.set_tensor(input_details[0]['index'], input_data1)

    # input_data2 = np.ones(input_details[1]['shape'], dtype=input_details[1]['dtype'])
    interpreter.set_tensor(input_details[1]['index'], input_data2)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])

    if verbose:
        print(output_data)

    return output_data

# Quantizer Decoder:
def runQuantizerDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_decoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])

    if verbose:
        print(output_data)

    return output_data

# Decoder:
def runDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="lyragan.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])

    if verbose:
        print(output_data)

    return output_data

audio_file = "<wavfile_path>.wav"
data, sr = getAudioData(audio_file)

# Check length and pad with zeroes so that the length % 320 = 0
import numpy as np

batch_size = 320
num_batches = len(data) // batch_size
waveform_data = None
audio_clips = None

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = (i + 1) * batch_size
    batch_data = data[start_idx:end_idx]

    enc_output = runEncoderInference(batch_data)
    qe_output = runQuantizerEncoderInference(enc_output)
    qd_output = runQuantizerDecoderInference(qe_output)
    dec_output = runDecoderInference(qd_output)

    if i == 0:
        waveform_data = dec_output[0] # Concatenates all waveform data
        audio_clips = dec_output # Stores waveform data as clips
    else:
        waveform_data = np.concatenate((waveform_data, dec_output[0]))
        audio_clips = np.concatenate((audio_clips, dec_output))

import torchaudio
import torch

audio_tensor = torch.tensor([waveform_data])
output_file = "<your_output_path>.wav"
torchaudio.save(output_file, audio_tensor, sr)

Questions

Resources

Sound samples.zip

shlomiez commented 2 months ago

Same happens to me... Did you solve it?

barrylee111 commented 1 month ago

@shlomiez The output of the data had a prefix that was being added with values that were really high or low. I don't remember the exact cause of the issue, but I do remember the fix was managing how we were staging and creating the DLL. One of the methods we added to the DLL was incorrectly prefixing those out of range data points to that data.

Hope this helps!