Discrepancies in Transcription Quality Between "diarize_whisper.rs" and "pyannote.rs" and ONNX Runtime 'Expand Node' Error for some segments

altunenes commented 3 months ago

I am running tests with about 20 different audio files with different languages. I try the same audio file with both "diarize_whisper.rs" and "pyannote.rs". First of all I can say that segmentation and identification nearly perfect for pyannote.rs. I can say almost 95% + acc. only parallel speech problematic.

I noticed that even if the audio files are exactly the same, the transcription done by "pyannote.rs" differs from the transcription done by "diarize_whisper.rs". I thought that this might be because different normalization processes are involved, but I could not identify the source in the examples. Although there are similarities in general usage for transcription, I would also like to say that diarize_whisper.rs gives better results.

Also, on one example (I mean one audio file), while speech and identification are correctly recognized transcription errors on the same segment on the "pyannote.rs". However, the same segment is correctly transcribed by "diarize_whisper.rs".

2024-08-09 22:42:18.1037286 [E:onnxruntime:, sequential_executor.cc:514 onnxruntime::ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/Expand' Status Message: invalid expand shape
test_sherpa_vs_py\target\debug\build\sherpa-rs-sys-ef584d6cfacf1777\out\sherpa-onnx\sherpa-onnx/csrc/offline-recognizer-whisper-impl.h:DecodeStream:176 

Caught exception:

Non-zero status code returned while running Expand node. Name:'/Expand' Status Message: invalid expand shape

Return an empty result. Number of input frames: 277, Current tail paddings: 1000. If you see a lot of such exceptions, please consider using a larger --whisper-tail-paddings
start = 17.65, end = 20.41, speaker = 2, text =  //so this is correct for start/end/speaker, but no text as you can see

TLDR: "diarize_whisper.rs" does excellent transcription. However, it does poor segmentation and identification.

"pyannote.rs" does excellent segmentation and identification, but transcription is slightly worse than "diarize_whisper.rs".

any ideas about this difference in terms of transcription?

thewh1teagle commented 3 months ago

I am running tests with about 20 different audio files with different languages. I try the same audio file with both "diarize_whisper.rs" and "pyannote.rs". First of all I can say that segmentation and identification nearly perfect for pyannote.rs. I can say almost 95% + acc. only parallel speech problematic.

Good to hear it's that accurate!

TLDR: "diarize_whisper.rs" does excellent transcription. However, it does poor segmentation and identification.

Can you share audio files? I tested the audio in the example and transcription and segmentation looks good with pyannote

By the way I added more example to pyannote-rs for testing more easily it saves the segments to files for checking

altunenes commented 3 months ago

Cool! Nice example for tests!

For example, this is one of the audio files I use:

But interestingly, in this example, it seems like Pyannote did a better transcript lol. But of course, this is probably also related to very good segmentation on pyannote side... But as you can see below for pyannote example, there is a warning message for 2 segments .

https://github.com/yinruiqing/pyannote-whisper/blob/main/data/afjiv.wav

with this example:https://github.com/thewh1teagle/sherpa-rs/blob/main/examples/pyannote.rs output:

start = 5.23, end = 12.26, speaker = 1, text =  I think if you're a leader and you don't understand the terms that you're using, that's probably the first start. It's really important that as...

start = 12.53, end = 15.86, speaker = 1, text =  a leader in the organization you understand what the digitization means.

start = 16.11, end = 24.68, speaker = 1, text =  You take the time to read widely in the sector. There are a lot of really good books, Kevin Kelly, who started Wild Magazine, has written a great book on that.

start = 24.99, end = 40.71, speaker = 1, text =  on various technologies. I think understanding the technologies, understanding what's out there so that you can separate the hype from the hope is really an important first step. And then making sure you understand the relevance of that for your function and how that fits into your business is the second step.

C:\Users\enes-\OneDrive\Masa├╝st├╝\test_vad\target\debug\build\sherpa-rs-sys-ef584d6cfacf1777\out\sherpa-onnx\sherpa-onnx/csrc/offline-recognizer-whisper-impl.h:DecodeStream:128 Only waves less than 30 seconds are supported. We process only the first 30 seconds and discard the remaining data

start = 40.85, end = 83.36, speaker = 2, text =  I think two simple suggestions. One is I love the phrase "Brilyon at the basics." How can you become "Brilyon at the basics?" Beyond that, the fundamental thing I've seen which hasn't changed is so few organizations as a first step have truly taking control of their spend data. As a key first step on the digital transformation, taking ownership of data. That's not a decision to use one vendor.

start = 84.32, end = 85.80, speaker = 2, text =  and the second thing is
2024-08-10 01:47:43.2191741 [E:onnxruntime:, sequential_executor.cc:514 onnxruntime::ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/Expand' Status Message: invalid expand shape
C:\Users\enes-\OneDrive\Masa├╝st├╝\test_vad\target\debug\build\sherpa-rs-sys-ef584d6cfacf1777\out\sherpa-onnx\sherpa-onnx/csrc/offline-recognizer-whisper-impl.h:DecodeStream:176

Caught exception:
Non-zero status code returned while running Expand node. Name:'/Expand' Status Message: invalid expand shape
Return an empty result. Number of input frames: 125, Current tail paddings: 1000. If you see a lot of such exceptions, please consider using a larger --whisper-tail-paddings

start = 86.46, end = 87.71, speaker = 2, text = 

start = 88.07, end = 94.04, speaker = 1, text =  to suppliers in the market, talk to them, collaborate with them, you'll get a much better app.

start = 94.36, end = 97.79, speaker = 2, text =  think about what outcome you want at the end.

start = 97.84, end = 118.36, speaker = 2, text =  instead of thinking about the different processes and their software names. So, Esourcing being one of 20. Think big and be brave, I think, and talk to technology vendors because rather than just sending them forms, we weren't bite you.

start = 118.61, end = 129.29, speaker = 2, text =  I think we should fundamentally all of us, we think how procurement should be done and then start to define the functionality that we need and how we can make.

start = 130.07, end = 133.44, speaker = 2, text =  What we do today is absolute t.

start = 134.81, end = 137.73, speaker = 1, text =  We don't like it, but cute, cute people don't.

start = 137.91, end = 144.55, speaker = 2, text =  it. I call it "Don't like it." Nobody wants it and which spending a huge amount of money.

and with https://github.com/thewh1teagle/sherpa-rs/blob/main/examples/diarize_whisper.rs

(speaker 0)  I think if you're a leader and you don't understand the terms that you're using, that's probably the first start. It's really important that as a leader in the organisation, you understand what digitization means. You take the time to read widely in the sector. There are a lot of really good books, Kevin Kelly, who started Wild Magazine, has written a great book on various technologies. I think. | 5.256s - 26.524s

(speaker 0)  understanding the technologies, understanding what's out there so that you can separate the hype from the hope is really an important first step. And then making sure you understand the relevance of that for your function and how that fits into your business is the second step. I think two simple suggestions. One is I love the phrase brilliant. | 26.6s - 47.42s

(speaker 1)  basics, right? So, you know, how can you become brilliant at the basics? But beyond that, you know, the fundamental thing I've seen which hasn't changed is so few organizations as a first step have truly taking control of their spend data. You know, as a key first step on the digital transformation, taking ownership of data. | 47.496s - 67.644s

(speaker 1)  And that's not a decision to use one vendor over someone else. That says, we are going to be completely data driven. We're going to try and be as real time as possible. And we're going to be able to explain that data to anyone the way they want to see it. Understand why you're doing it. | 67.944s - 83.856s

(speaker 2)  and the second thing is... | 84.84s - 86.224s

(speaker 2)  reach out. | 87.016s - 88.08s

(speaker 2)  to suppliers in the market, talk to them, collaborate with them, you'll get a much better outcome. | 88.68s - 94.48s

(speaker 3)  think about what outcome you want at the end. | 95.048s - 98.09599s

(speaker 3)  instead of thinking about | 98.632s - 100.656006s

(speaker 3)  the different processes and software names. | 101.064s - 104.72s

(speaker 3)  he's sourcing being one of 20. Think beer can be brave, I think, and talk to technology vendors. | 105.384s - 113.392006s

(speaker 3)  because rather than just sending them forms. | 113.768s - 116.24s

(speaker 4)  We weren't bite you. I think we should fundamentally all of us. We think. | 116.84s - 122.479996s

(speaker 4)  how peculiar should be done | 122.92s - 124.336s

(speaker 4)  and then start to define the functionality that we need and how we can make this work. | 125.192s - 129.84s

(speaker 4)  What we do today is absolutely | 130.856s - 133.296s

(speaker 5)  "Bow" | 133.672s - 134.256s

(speaker 4)  We don't like it, but you don't like it. | 135.624s - 138.224s

(speaker 4)  I call it "Don't Like It" Nobody wants it and which spending a huge amount of money | 138.888s - 143.952s

(speaker 4)  for no reason | 144.616s - 145.48799s

altunenes commented 3 months ago

note: using the "base model" instead of the "tiny model" fixes the errors:

"2024-08-13 00:51:19.1267172 [E:onnxruntime:, sequential_executor.cc:514 onnxruntime::ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/Expand' Status Message: invalid expand shape
C:\Users\enes-\OneDrive\Masa├╝st├╝\test_vad\target\debug\build\sherpa-rs-sys-ef584d6cfacf1777\out\sherpa-onnx\sherpa-onnx/csrc/offline-recognizer-whisper-impl.h:DecodeStream:176

Caught exception:

Non-zero status code returned while running Expand node. Name:'/Expand' Status Message: invalid expand shape"

thewh1teagle commented 1 month ago

See the new diarize example, it should be more accurate

altunenes commented 1 month ago

let's try it!!! but again you put me in a “loop of indecision” :-) whisper or sherpa :))))

just kidding haha, thank you <3

thewh1teagle / sherpa-rs

Discrepancies in Transcription Quality Between "diarize_whisper.rs" and "pyannote.rs" and ONNX Runtime 'Expand Node' Error for some segments #10