Closed ManffTee closed 2 weeks ago
It could be because you are using a stream and not providing a large enough chunk of audio from that stream to the model. What happens if you write the audio from one pass to a wav file rather than transcribing it? What does it sound like?
I have noticed this issue with moonshine before (running onnx model in a swift iOS app). However, the lack of output does not (at least in my limited testing) correspond with the number of samples passed in. I suggest using another model as backup in case this result is empty.
Interesting. If you are able to provide an audio sample or test case where the issue is repeatable, that would be very helpful.
Interesting. If you are able to provide an audio sample or test case where the issue is repeatable, that would be very helpful.
Don't think I will be able to provide those at this time since I am running what is essentially live_captions.py
on my Swift app with mic input which makes it difficult to run reproducible tests.
In my case I modified my code to export data in different spots
debug_audio_float32.txt -> after convert audio data to numpy debug_audio_downsampled.txt -> after downsampling from 48000 to 16000 debug_audio_flattened.txt -> after flattern the audio data audio.wav -> exported audio
I ran the following:
$ ffmpeg -i audio.wav -ar 16000 audio_16k.wav
$ python
>>> import moonshine
>>> moonshine.transcribe('audio_16k.wav')
/data/env_keras_moonshine/lib/python3.10/site-packages/keras/src/ops/nn.py:545: UserWarning: You are using a softmax over axis 3 of a tensor of shape torch.Size([1, 8, 1, 1]). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
warnings.warn(
['I want cookies.']
>>>
I wonder why it doesn't work with the Numpy array then
It looks like I had a problem with the preprocesing of the audio data.
Here is my updated code:
def transcribe_stream(self, audio_stream: io.BytesIO, **kwargs) -> Optional[str]: from pydub import AudioSegment
try:
text = ""
# Preprocessing
# Load audio from BytesIO
audio_segment = AudioSegment.from_file(audio_stream)
# Convert to mono
audio_segment = audio_segment.set_channels(1)
# Set frame rate to 16000
audio_segment = audio_segment.set_frame_rate(16000)
# Set sample width to 2 bytes (16-bit PCM)
audio_segment = audio_segment.set_sample_width(2)
# Convert the audio segment to raw bytes
audio_bytes = audio_segment.raw_data
# Convert audio to NumPy array
audio = (
np.frombuffer(audio_bytes, np.int16) / 32768.0
) # Normalize to [-1, 1]
audio = audio.astype(np.float32)[None, ...] # Add batch dimension
tokens = self.model.generate(audio)
# Decode the tokens to get text
logger.info("Start decoding")
text = self.tokenizer.decode_batch(tokens)
logger.info(f"Decoded text: {text}")
return "", -1.0, text
except Exception as e:
# Log and save any error that occurs
logger.error(f"An error occurred: {e}")
return "", -1.0, ""
I used pydub to preprocess the audio data This fixed my concern
Thank you for your support
I have an WebSocket API for Speech To Text and I want to use moonshine as an alternative for faster-whisper but the decoded text is empty.
This is my log after receiving the audio signal:
MoonShineTranscriotionService.py:66 - transcribe stream 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:90 - Audio shape before reshaping: (41472,) 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:96 - Audio shape after reshaping: (1, 41472) 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:101 - Max. token length: 15 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:110 - Token count: 1 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:115 - Decoded text:
This is the used function:
def transcribe_stream(self, audio_stream: io.BytesIO): import soundfile as sf from scipy.signal import resample
Has someone an idea why it returns nothing