Vosk transcription may slow down during large cases processing

As discussed in #1899, when using Vosk audio transcription in a very large case (with many audio files), @paulobreim noticed that CPU usage fell after some point. I was able to reproduce the issue by processing a large sample (~150K) of audio files.

Later I wrote the small standalone program below, which made it easier to reproduce the problem (using a PC running Windows and 48 logical processors). With this program and the sample audio below, the issue (CPU usage decreases and transcription slows down) is noticeable after a couple of minutes.

import java.io.File;
import java.io.InputStream;
import java.util.SplittableRandom;

import javax.sound.sampled.AudioSystem;

import org.vosk.Model;
import org.vosk.Recognizer;

public class VoskTest {
    public static void main(String[] args) throws Exception {
        Model model = new Model("vosk-model-small-en-us-0.15");
        Thread[] threads = new Thread[Runtime.getRuntime().availableProcessors()];
        for (int i = 0; i < threads.length; i++) {
            (threads[i] = new Thread() {
                public void run() {
                    try {
                        byte[] buf = new byte[1 << 20];
                        SplittableRandom rnd = new SplittableRandom();
                        Recognizer recognizer = new Recognizer(model, 16000);
                        recognizer.setWords(true);
                        for (int rep = 0; rep < 10000; rep++) {
                            InputStream ais = AudioSystem.getAudioInputStream(new File("sample-audio.wav"));
                            int nbytes = 0;
                            while ((nbytes = ais.read(buf)) >= 0) {
                                if (recognizer.acceptWaveForm(buf, nbytes)) {
                                    recognizer.getResult();
                                } else {
                                    recognizer.getPartialResult();
                                }
                            }
                            ais.close();
                            recognizer.getFinalResult();
                            recognizer.reset();
                            System.out.println(rep + ":" + Thread.currentThread().getName());
                            Thread.sleep(rnd.nextInt(10));
                        }
                        recognizer.close();
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
            }).start();
        }
        for (Thread t : threads) {
            t.join();
        }
        model.close();
    }
}

The sample audio I used: sample-audio.zip

Things that I tried but did NOT make any difference regarding the described behavior:

Upgrade Vosk library (currently we use 0.3.32, latest is 0.3.45);
Recreate the Recognizer object after N audios (N = 64 and N = 1);
Use a different Model object for each thread (not a good idea in practice);
Load the whole audio file at once (in practice, it wouldn't be possible for very large files).

After a lot of failed attempts, I finally found out that limiting the reading buffer (e.g. to 64 KB) solved the issue (currently a 1 MB buffer is used). I guess that there is some kind of internal (native) memory buffer used by Vosk, handled by a synchronized piece of code, that somehow was having trouble dealing with large inputs and many threads.

sepinf-inc / IPED

Vosk transcription may slow down during large cases processing #1909