How to obtain moshi response using API

Due diligence

[X] I have done my due diligence in trying to find the answer myself.

Topic

The pytorch implementation

Question

Hello, thanks for your great work. I am trying the python API to see if I can use existing audio files to simulate streaming input obtain moshi's reply, but it didn't work as expected so I assume I am not using it the proper way. Could you kindly take a look?

my main question:

when I have an existing audio and I want moshi to listen and respond to it, it always respond with greeting first, and then maybe remain silent, or it may say something (sometimes). does it mean I need to add a very long pause to wait for it to reply?What is the best practice to make it reply to a piece of given speech?

Some other questions:

if I want to use my earlier input, moshi's reply, my new input to get a new round of reply from it, how should I form my input? (like how should I hack it so that moshi will know what she replied earlier?)
can I control more on how it replies? say, if I have a script already, can I make moshi to follow that script to converse with me?

my code that I used when I tried to solve question 1:

mostly borrowed from moshi's readme
modifications: (1) I padded my input audio file to make sure the number of samples the multiple of 1920. (2) I put the models in a local dir so I changed the default_repo (3) I added 4 seconds of silence at the end of my audio. Initially I didn't add silence, and moshi didn't produce the reply, so I thought maybe I need to simulate human pause, but either way it didn't work properly.

loaders.DEFAULT_REPO = "/data/resources/models/kyutai/moshika-pytorch-bf16/"
device = "cuda"
mimi_weight = os.path.join(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
mimi = loaders.get_mimi(mimi_weight, device=device)
mimi.set_num_codebooks(8)  # up to 32 for mimi, but limited to 8 for moshi.

def padding(wav: np.ndarray, multiple: int = 1920) -> np.ndarray:
    """
    Pads the audio signal to make its length a multiple of the specified value.

    Parameters:
    - wav (np.ndarray): The input audio signal, shape: [T].
    - multiple (int): The target multiple to pad the length to.

    """
    if not isinstance(wav, np.ndarray):
        raise ValueError("Input wav must be a NumPy array.")

    if multiple <= 0:
        raise ValueError("Multiple must be a positive integer.")

    # Calculate the current length and the padding needed
    current_length = wav.shape[0]
    padding_length = (multiple - (current_length % multiple)) % multiple

    # Add zero-padding to the end of the audio
    if padding_length > 0:
        wav = np.pad(wav, (0, padding_length), mode='constant', constant_values=0)

    return wav

# create an input data of a speech audio and add 4s silence at the end.
wav, sr = librosa.load(wavpath, sr = 24000,mono=True)
silence_duration = 4  # 
silence = np.zeros(int(sr * silence_duration), dtype=np.float32)
wav = np.concatenate((wav, silence))
wav = padding(wav) 
wav = torch.tensor(wav).unsqueeze(0).unsqueeze(0).to(device)  # Shape: [B=1, C=1, T]

# encode the input
with torch.no_grad():
    nonstream_codes = mimi.encode(wav)  # [B, K = 8, T]
    non_stream_decoded = mimi.decode(nonstream_codes)

    # Supports streaming too.
    frame_size = int(mimi.sample_rate / mimi.frame_rate) # 1920
    all_codes = []
    with mimi.streaming(batch_size=1):
        for offset in range(0, wav.shape[-1], frame_size):
            frame = wav[:, :, offset: offset + frame_size]
            codes = mimi.encode(frame)
            assert codes.shape[-1] == 1, codes.shape
            all_codes.append(codes)

import gc
def clear_cache():
    gc.collect()
    torch.cuda.empty_cache()

out_wav_chunks = []
# Now we will stream over both Moshi I/O, and decode on the fly with Mimi.
with torch.no_grad(), lm_gen.streaming(1), mimi.streaming(1):
    for idx, code in enumerate(all_codes):
        tokens_out = lm_gen.step(code.cuda())
        # tokens_out is [B, 1 + 8, 1], with tokens_out[:, 1] representing the text token.
        if tokens_out is not None:
            wav_chunk = mimi.decode(tokens_out[:, 1:])
            out_wav_chunks.append(wav_chunk)
        print(idx, end='\r')
out_wav = torch.cat(out_wav_chunks, dim=-1)
clear_cache()

# save the output file
out_wav_np = out_wav.squeeze().cpu().numpy()
output_path = "output_moshi.wav"
torchaudio.save(output_path, torch.tensor(out_wav_np).unsqueeze(0), sample_rate=24000)

I tried many times with many audio of different length but it always just returned moshi saying something like "hey what'up" or "hey how's it going". There is once or twice that it replied something meaningful after greeting, but still, I hope it can just "listen " to my words and reply without always greating first . I am trying to look into the code too, but I think I am not doing it the proper way. Could you please give more guide on how to use the API to play around it? Thank you! Any suggestion is much appreciated!

kyutai-labs / moshi