jitsi / jiwer

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
Apache License 2.0
613 stars 94 forks source link

Can somebody explain me what the `_preprocess()` function is doing? #33

Open samarth12 opened 3 years ago

samarth12 commented 3 years ago

I ran the standard example and walk through it in the debugger.

from jiwer import wer

ground_truth = "hello world"
hypothesis = "hello duck"

error = wer(ground_truth, hypothesis)

In the steps, it uses the _preprocess() function to convert the words/tokens into integer representations.

hello world is converted to \x00\x02 and hello duck is converted to \x00\x01. Then the Levenshtein distance is calculated on these strings rather than the original words. I am not sure how that maps to the original definition of Word Error Rate.

Can somebody explain to me what is happening?

nikvaessen commented 3 years ago

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. (https://en.wikipedia.org/wiki/Levenshtein_distance)

Due to the fact that we convert each word in a sentence to an unique integer such as \x00, the ground truth and hypothesis sentences become two "words" of integer tokens. We can then use the Levenshtein distance to calculate the insertions (I), deletions (D) or substitutions (S) required to change the hypothesis "integer word" into the ground truth "integer word". We can than simply plug the S, D, and I values we found in the WER formula.

samarth12 commented 3 years ago

Oh, I realized this a little late, the Levenshtein distance in WER is calculated on words rather than characters. So that is done to convert each word into a unique ID, makes sense now. Thank you!