jiwer gives an error when passed a very long list of strings

jitsi / jiwer

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)

Apache License 2.0

628 stars 97 forks source link

jiwer gives an error when passed a very long list of strings #83

Open yashk2000 opened 1 year ago

yashk2000 commented 1 year ago

Issue

When passing a very long list of strings (>350k strings) as the reference and hypothesis, jiwer gives the following error:

chr() arg not in range(0x110000)

What's been tried:

Calculating wer on individual list elements - this works successfully with no error
Splitting the large lists into smaller chunks - this works successfully with no error
Passing the entire list to another library such as fastwer - this works successfully with no error

The error only seems to happen when the entire long list is passed into jiwer.

Additional Context

It seems like the vocabulary in the _word2char function isn't built properly. After adding words from the first N sentences in the list, words from rest of the sentences do not seem to be a part of the vocabulary. This results in the chr() arg not found error when these lines are executed.

Jiwer version - v3.0.3

nikvaessen commented 1 year ago

Thanks for reporting! Would you be able to share the length of the vocabulary object when generated from your input?

yashk2000 commented 1 year ago

Yep, it's 4484.

nikvaessen commented 1 year ago

I cannot reproduce it with the following toy data :(

import random
import string

from typing import List

import jiwer

def random_word(low=2, high=10, rng=random.Random()) -> string:
    word = ""

    for i in range(rng.randint(low, high + 1)):
        word += rng.choice(string.ascii_lowercase)

    return word

def generate_sentence(vocabulary: List[str], low=1, high=12, rng=random.Random()):
    sentence = []

    for i in range(rng.randint(low, high + 1)):
        sentence.append(rng.choice(vocabulary))

    return " ".join(sentence)

NUM_SENTENCE = 500_000
NUM_WORDS = 5000

print('generating vocab...')
vocabulary = list(set([random_word() for _ in range(NUM_WORDS)]))
print(len(vocabulary))

print("generating reference...")
ref = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("generating hypotheses...")
hyp = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("calculating wer...")
print(jiwer.wer(ref, hyp))

Can you share the word which fails to be included in the vocabulary?

yashk2000 commented 1 year ago

The words which are not included are normal english words - words from entire sentences aren't included like "Australia", "he", "run", etc.

Some sentences in my list also include numbers like "1", "10", so on and can also include non-english characters at time too. Could this be a potential cause of the issue?

nikvaessen commented 1 year ago

I think the size is not an issue, I think it's a specific sentence-pairing which fails. When you tested chunks of the dataset, did those chunks still span the entire range of reference/hypothesis pairs?

Also, do you use a custom transform, or do you use the default?

yashk2000 commented 1 year ago

@nikvaessen when I test chunks, the chunks do span the entire range of the pairs. I have also tried finding wer by looping over one pair at a time, that also works.

I'm using the default transform.