jitsi / jiwer

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
Apache License 2.0
620 stars 95 forks source link

Multispeaker WER #34

Open mpariente opened 3 years ago

mpariente commented 3 years ago

Hi, thanks a bunch for this tool !

When working with speech mixtures, WER can take into account that words from each speaker might be picked up. There is a description of the method here: https://my.fit.edu/~vkepuska/ece5527/sctk-2.3-rc1/doc/asclite.html

Would you be willing to integrate this feature in Jiwer?

nikvaessen commented 3 years ago

I think there are two ways of implementing this:

1) we need a either wrap around asclite which will require shipping its binary for every platform 2) or write a custom dynamic programming solution, which would be most likely be very slow if implemented in python, or difficult if it needs to be written in C (I don't have much if any experience in writing C and integrating it into a python application).

How would you use this feature? Are there many speech datasets which have this problem?

mpariente commented 3 years ago

Thanks for your answer.

How would you use this feature? Are there many speech datasets which have this problem?

All datasets that include overlapping speech have this problem. Few examples: Chime5-6, AMI, wsj0-mix, Librimix. In order to evaluate speech separation algorithm, this seems to be needed.

I'd go with solution 1. I personally wouldn't ship the binaries but link to the installation instructions. This would be an optional feature of jiwer, and the user would need to make an extra step to benefit from it. WDYT?