k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.11k stars 360 forks source link

Add pyannote vad (segmentation) model #1197

Open thewh1teagle opened 1 month ago

thewh1teagle commented 1 month ago

I would like to use sherpa-onnx for speaker diarization. However the current vad modal (silero) doesn't works well and doesn't detect speech correctly. I tried another onnx model in the project pengzhendong/pyannote-onnx and it detects much better. It's based on onnx too. Can we add this model for sherpa-onnx?

csukuangfj commented 1 month ago

Would you like to contribute?

thewh1teagle commented 1 month ago

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

csukuangfj commented 1 month ago

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

Ok, we can take a look but not this week. It may take sometime to add it.

thewh1teagle commented 1 month ago

Ok, we can take a look but not this week. It may take sometime to add it.

Meanwhile I created basic implementation in Python. It looks accurate

# python3 -m venv venv 
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py

import onnxruntime as ort
import librosa
import numpy as np

def init_session(model_path):
    opts = ort.SessionOptions()
    opts.inter_op_num_threads = 1
    opts.intra_op_num_threads = 1
    opts.log_severity_level = 3
    sess = ort.InferenceSession(model_path, sess_options=opts)
    return sess

def read_wav(path: str):
    return librosa.load(path, sr=16000)

if __name__ == '__main__':
    session = init_session('segmentation-3.0.onnx')
    samples, sample_rate = read_wav('test.wav')

    # Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
    frame_size = 270
    frame_start = 721
    window_size = sample_rate * 10 # 10s

    # State and offset
    is_speeching = False
    offset = frame_start
    start_offset = 0

    # Pad end with silence for full last segment
    samples = np.pad(samples, (0, window_size), 'constant') 

    for start in range(0, len(samples), window_size):
        window = samples[start:start + window_size]
        ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
        for probs in ort_outs:
            predicted_id = np.argmax(probs)
            if predicted_id != 0:
                if not is_speeching:
                    start_offset = offset
                    is_speeching = True
            elif is_speeching:
                start = round(start_offset / sample_rate, 3)
                end = round(offset / sample_rate, 3)
                print(f'{start}s - {end}s')
                is_speeching = False
            offset += frame_size