usefulsensors / moonshine

Fast and accurate automatic speech recognition (ASR) for edge devices
MIT License
2.22k stars 98 forks source link

Decoded Text is empty for io.BytesIO #52

Closed ManffTee closed 2 weeks ago

ManffTee commented 2 weeks ago

I have an WebSocket API for Speech To Text and I want to use moonshine as an alternative for faster-whisper but the decoded text is empty.

This is my log after receiving the audio signal:

MoonShineTranscriotionService.py:66 - transcribe stream 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:90 - Audio shape before reshaping: (41472,) 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:96 - Audio shape after reshaping: (1, 41472) 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:101 - Max. token length: 15 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:110 - Token count: 1 2024-11-05 13:10:33 2024-11-05 12:10:33 [INFO] MoonShineTranscriptionService.py:115 - Decoded text:

This is the used function:

def transcribe_stream(self, audio_stream: io.BytesIO): import soundfile as sf from scipy.signal import resample

    logger.info("transcribe stream")

    try:
        audio_stream.seek(0)
        audio, samplerate = sf.read(audio_stream)

        # Check if audio_data is empty
        if audio.size == 0:
            logger.info("Audio stream is empty")
            return None

        # Convert from double to float
        audio = audio.astype(np.float32)

        # Downsample audio from 48000 Hz to 16000 Hz
        if samplerate != 16000:
            num_samples = int(len(audio) * (16000 / samplerate))
            audio = resample(audio, num_samples)
            samplerate = 16000

        # Single channel
        audio = audio.copy().flatten()

        # Check shape before reshaping
        logger.info(f"Audio shape before reshaping: {audio.shape}")

        # Reshape audio to (batch_size, input_features)
        audio = audio[np.newaxis, :]

        # Log the new shape
        logger.info(f"Audio shape after reshaping: {audio.shape}")

        # Max. token length
        max_len = int((audio.shape[-1] / samplerate) * 6)

        logger.info(f"Max. token length: {max_len}")

        tokens = self.model.generate(audio, max_len)

        # Tokens are generated successfully
        if tokens is None or len(tokens) == 0:
            logger.info("No tokens generated")
            return None

        logger.info(f"Token count: {len(tokens)}")

        # Decode the tokens
        text = self.tokenizer.decode_batch(tokens)[0]

        logger.info(f"Decoded text: {text}")

        return "", -1.0, text

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        return "", -1.0, ""

Has someone an idea why it returns nothing

evmaki commented 2 weeks ago

It could be because you are using a stream and not providing a large enough chunk of audio from that stream to the model. What happens if you write the audio from one pass to a wav file rather than transcribing it? What does it sound like?

nirbhaysinghnarang commented 2 weeks ago

I have noticed this issue with moonshine before (running onnx model in a swift iOS app). However, the lack of output does not (at least in my limited testing) correspond with the number of samples passed in. I suggest using another model as backup in case this result is empty.

evmaki commented 2 weeks ago

Interesting. If you are able to provide an audio sample or test case where the issue is repeatable, that would be very helpful.

nirbhaysinghnarang commented 2 weeks ago

Interesting. If you are able to provide an audio sample or test case where the issue is repeatable, that would be very helpful.

Don't think I will be able to provide those at this time since I am running what is essentially live_captions.py on my Swift app with mic input which makes it difficult to run reproducible tests.

ManffTee commented 2 weeks ago

In my case I modified my code to export data in different spots

debug_audio_float32.txt -> after convert audio data to numpy debug_audio_downsampled.txt -> after downsampling from 48000 to 16000 debug_audio_flattened.txt -> after flattern the audio data audio.wav -> exported audio

debug_audio.zip

keveman commented 2 weeks ago

I ran the following:

$ ffmpeg -i audio.wav -ar 16000 audio_16k.wav
$ python
>>> import moonshine
>>> moonshine.transcribe('audio_16k.wav')
/data/env_keras_moonshine/lib/python3.10/site-packages/keras/src/ops/nn.py:545: UserWarning: You are using a softmax over axis 3 of a tensor of shape torch.Size([1, 8, 1, 1]). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
  warnings.warn(
['I want cookies.']
>>> 
ManffTee commented 2 weeks ago

I wonder why it doesn't work with the Numpy array then

ManffTee commented 2 weeks ago

It looks like I had a problem with the preprocesing of the audio data.

Here is my updated code:

def transcribe_stream(self, audio_stream: io.BytesIO, **kwargs) -> Optional[str]: from pydub import AudioSegment

    try:
        text = ""

        # Preprocessing
        # Load audio from BytesIO
        audio_segment = AudioSegment.from_file(audio_stream)

        # Convert to mono
        audio_segment = audio_segment.set_channels(1)

        # Set frame rate to 16000
        audio_segment = audio_segment.set_frame_rate(16000)

        # Set sample width to 2 bytes (16-bit PCM)
        audio_segment = audio_segment.set_sample_width(2)

        # Convert the audio segment to raw bytes
        audio_bytes = audio_segment.raw_data

        # Convert audio to NumPy array
        audio = (
            np.frombuffer(audio_bytes, np.int16) / 32768.0
        )  # Normalize to [-1, 1]
        audio = audio.astype(np.float32)[None, ...]  # Add batch dimension

        tokens = self.model.generate(audio)

        # Decode the tokens to get text
        logger.info("Start decoding")
        text = self.tokenizer.decode_batch(tokens)
        logger.info(f"Decoded text: {text}")

        return "", -1.0, text

    except Exception as e:
        # Log and save any error that occurs
        logger.error(f"An error occurred: {e}")
        return "", -1.0, ""

I used pydub to preprocess the audio data This fixed my concern

Thank you for your support