vilassn / whisper_android

Offline Speech Recognition with OpenAI Whisper and TensorFlow Lite for Android
MIT License
246 stars 40 forks source link

Is there any additional way to improve the performance? #14

Closed KihongK closed 1 month ago

KihongK commented 5 months ago

We are currently using the Whisper-Tiny multilingual model and seeking ways to improve its performance. We would appreciate any insights or suggestions on how to enhance the model's accuracy, speed, and overall efficiency.

Model: Whisper-Tiny Multilingual TFlite (decord_id is Korean) https://github.com/nyadla-sys/whisper.tflite/discussions/15#discussioncomment-7362798

Apply:
https://github.com/vilassn/whisper_android/issues/4#issuecomment-1846846235 +To achieve real-time speech processing, we are using sendData(sample) in Recorder.java

The accuracy is low; are there any ways to improve it?

I understand that the accuracy is low partly because it is a Tiny model and it has been converted to a tflite model, but I would still like to improve the performance as much as possible.

vilassn commented 4 months ago

@KihongK The primary reason for the low accuracy of the Whisper-Tiny model is due to how the audio data is being segmented. Currently, we are feeding 3-second audio clips without considering the natural pauses in speech, which often results in cutting off words and phrases. To improve the accuracy, we should segment the audio based on pauses in the speech rather than fixed time intervals. This can be achieved by implementing voice activity detection (VAD) to detect and segment speech more naturally.

To improve the performance, improve Audio Segmentation. Utilize voice activity detection (VAD) to ensure that audio clips are cut at natural pauses, avoiding mid-word and mid-sentence breaks.