Closed UltraEval closed 6 days ago
It's certainly hard to tell what is going on, one thing that might be worth checking is that your audio starts immediately and the model always start the conversations with some incipit like "Hello, how are you doing?" so it may be better to delay your audio for a bit until the model has finished with its first sentence (maybe just hardcoding some delay like 5s would be good enough).
I tried two methods to process the audio file:
Adding Pre-Zero Padding:
wav = load_wav(audio_path,24000)
print('Input length:', wav.shape)
current_length = wav.shape[-1]
target_length = ((current_length - 1) // 1920 + 1) * 1920
if current_length < target_length:
padding = target_length - current_length
wav = torch.nn.functional.pad(wav, (0, padding))
emtpy_frames = torch.zeros(1, 1920*10)
wav = torch.cat([emtpy_frames, wav], dim=-1)
Adding Silence Audio:
from pydub import AudioSegment
from pydub.playback import play
silence_duration = 5000
silence = AudioSegment.silent(duration=silence_duration)
original_audio = AudioSegment.from_wav("test.wav")
padded_audio = silence + original_audio
padded_audio.export("padded_audio.wav", format="wav")
However, the output remains the same as before. So be it.
Due diligence
Topic
The PyTorch implementation
Question
Thanks for nice work.
I write code for answer audio file following the example of https://github.com/kyutai-labs/moshi/tree/main/moshi#api .
like this
sometime, it output is normal:
but too many times, it just output:
the above output is from same input file:
test.wav.zip
It's kind of weird, can you check it?