Open yashk2000 opened 1 year ago
Thanks for reporting! Would you be able to share the length of the vocabulary object when generated from your input?
Yep, it's 4484.
I cannot reproduce it with the following toy data :(
import random
import string
from typing import List
import jiwer
def random_word(low=2, high=10, rng=random.Random()) -> string:
word = ""
for i in range(rng.randint(low, high + 1)):
word += rng.choice(string.ascii_lowercase)
return word
def generate_sentence(vocabulary: List[str], low=1, high=12, rng=random.Random()):
sentence = []
for i in range(rng.randint(low, high + 1)):
sentence.append(rng.choice(vocabulary))
return " ".join(sentence)
NUM_SENTENCE = 500_000
NUM_WORDS = 5000
print('generating vocab...')
vocabulary = list(set([random_word() for _ in range(NUM_WORDS)]))
print(len(vocabulary))
print("generating reference...")
ref = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("generating hypotheses...")
hyp = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("calculating wer...")
print(jiwer.wer(ref, hyp))
Can you share the word which fails to be included in the vocabulary?
The words which are not included are normal english words - words from entire sentences aren't included like "Australia", "he", "run", etc.
Some sentences in my list also include numbers like "1", "10", so on and can also include non-english characters at time too. Could this be a potential cause of the issue?
I think the size is not an issue, I think it's a specific sentence-pairing which fails. When you tested chunks of the dataset, did those chunks still span the entire range of reference/hypothesis pairs?
Also, do you use a custom transform, or do you use the default?
@nikvaessen when I test chunks, the chunks do span the entire range of the pairs. I have also tried finding wer by looping over one pair at a time, that also works.
I'm using the default transform.
Issue
When passing a very long list of strings (>350k strings) as the reference and hypothesis, jiwer gives the following error:
chr() arg not in range(0x110000)
What's been tried:
The error only seems to happen when the entire long list is passed into jiwer.
Additional Context
It seems like the vocabulary in the
_word2char
function isn't built properly. After adding words from the first N sentences in the list, words from rest of the sentences do not seem to be a part of the vocabulary. This results in thechr() arg not found
error when these lines are executed.Jiwer version -
v3.0.3