Closed KihongK closed 1 month ago
@KihongK The primary reason for the low accuracy of the Whisper-Tiny model is due to how the audio data is being segmented. Currently, we are feeding 3-second audio clips without considering the natural pauses in speech, which often results in cutting off words and phrases. To improve the accuracy, we should segment the audio based on pauses in the speech rather than fixed time intervals. This can be achieved by implementing voice activity detection (VAD) to detect and segment speech more naturally.
To improve the performance, improve Audio Segmentation. Utilize voice activity detection (VAD) to ensure that audio clips are cut at natural pauses, avoiding mid-word and mid-sentence breaks.
We are currently using the Whisper-Tiny multilingual model and seeking ways to improve its performance. We would appreciate any insights or suggestions on how to enhance the model's accuracy, speed, and overall efficiency.
Model: Whisper-Tiny Multilingual TFlite (decord_id is Korean) https://github.com/nyadla-sys/whisper.tflite/discussions/15#discussioncomment-7362798
Apply:
https://github.com/vilassn/whisper_android/issues/4#issuecomment-1846846235 +To achieve real-time speech processing, we are using sendData(sample) in Recorder.java
The accuracy is low; are there any ways to improve it?
I understand that the accuracy is low partly because it is a Tiny model and it has been converted to a tflite model, but I would still like to improve the performance as much as possible.