rhasspy / gruut

A tokenizer, text cleaner, and phonemizer for many human languages.
MIT License
279 stars 36 forks source link

Lots of broken tests #40

Open PureTryOut opened 1 year ago

PureTryOut commented 1 year ago

As part of packaging this for Alpine Linux to run mimic3, I'm trying to run the test suite of gruut. While some tests succeed, about half of them fail, all of them basically the same way.

________________________________________________________________ EnglishTestCase.test_times _________________________________________________________________

self = <tests.test_en.EnglishTestCase testMethod=test_times>

    def test_times(self):
        """Test expansion of times"""
        text = "4:01am and 4:01 p.m."
        sentence = next(sentences(text, lang="en_US"))

>       self.assertEqual(
            ["four", "oh", "one", "A", "M", "and", "four", "oh", "one", "P", "M"],
            [word.text for word in sentence],
        )
E       AssertionError: Lists differ: ['four', 'oh', 'one', 'A', 'M', 'and', 'four', 'oh', 'one', 'P', 'M'] != ['four', 'oh', 'one', 'A', 'M', 'and', 'four', 'oh', 'one', 'p', 'm']
E       
E       First differing element 9:
E       'P'
E       'p'
E       
E       - ['four', 'oh', 'one', 'A', 'M', 'and', 'four', 'oh', 'one', 'P', 'M']
E       ?                                                              ^    ^
E       
E       + ['four', 'oh', 'one', 'A', 'M', 'and', 'four', 'oh', 'one', 'p', 'm']
E       ?                                                              ^    ^

tests/test_en.py:159: AssertionError
_____________________________________________________________ EnglishTestCase.test_unclean_text _____________________________________________________________

self = <tests.test_en.EnglishTestCase testMethod=test_unclean_text>

    def test_unclean_text(self):
        """Test text with lots of noise"""
        text = (
            "IT’S <a> 'test' (seNtEnce) for-only $100, Dr., & [I] ## *like* ## it 100%!"
        )
        sentence = next(sentences(text, lang="en_US"))

>       self.assertEqual(
            [
                "IT'S",
                "<",
                "a",
                ">",
                "'",
                "test",
                "'",
                "(",
                "seNtEnce",
                ")",
                "for",
                "only",
                "one",
                "hundred",
                "dollars",
                ",",
                "Doctor",
                ",",
                "and",
                "[",
                "I",
                "]",
                "*",
                "like",
                "*",
                "it",
                "one",
                "hundred",
                "percent",
                "!",
            ],
            [word.text for word in sentence],
        )
E       AssertionError: Lists differ: ["IT'[70 chars]y', 'one', 'hundred', 'dollars', ',', 'Doctor'[81 chars] '!'] != ["IT'[70 chars]y', '$100', ',', 'Doctor', ',', 'and', '[', 'I[60 chars] '!']
E       
E       First differing element 12:
E       'one'
E       '$100'
E       
E       First list contains 2 additional elements.
E       First extra element 28:
E       'percent'
E       
E         ["IT'S",
E          '<',
E          'a',
E          '>',
E          "'",
E          'test',
E          "'",
E          '(',
E          'seNtEnce',
E          ')',
E          'for',
E          'only',
E       +  '$100',
E       -  'one',
E       -  'hundred',
E       -  'dollars',
E          ',',
E          'Doctor',
E          ',',
E          'and',
E          '[',
E          'I',
E          ']',
E          '*',
E          'like',
E          '*',
E          'it',
E          'one',
E          'hundred',
E          'percent',
E          '!']

tests/test_en.py:18: AssertionError
_________________________________________________________ FrenchTestCase.test_liason_adjective_noun _________________________________________________________

self = <tests.test_fr.FrenchTestCase testMethod=test_liason_adjective_noun>

    def test_liason_adjective_noun(self):
        """Test liason between adjective and noun"""
>       self._without_and_with_liason(
            "J’ai des petites oreilles.",
            "petites",
            ["p", "ə", "t", "i", "t"],
            ["p", "ə", "t", "i", "t", "z"],
        )

tests/test_fr.py:52: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_fr.py:81: in _without_and_with_liason
    self.assertEqual(word.phonemes, without_phonemes)
E   AssertionError: None != ['p', 'ə', 't', 'i', 't']

self = <tests.test_sqlite_phonemizer.PhonemizerTestCase testMethod=test_ar>

    def test_ar(self):
        """Arabic test"""
>       self.assertEqual(
            get_phonemes("حَوّامتي مُمْتِلئة", "ar"),
            [
                ("حَوَّامَتُي", ["ħ", "a", "u", "aː", "m", "t", "iː"]),
                ("مُمْتِلِئَة", ["m", "u", "m", "t", "i", "l", "i", "ʔ", "i"],),
            ],
        )
E       AssertionError: Lists differ: [] != [('حَوَّامَتُي', ['ħ', 'a', 'u', 'aː', 'm'[73 chars]i'])]
E       
E       Second list contains 2 additional elements.
E       First extra element 0:
E       ('حَوَّامَتُي', ['ħ', 'a', 'u', 'aː', 'm', 't', 'iː'])
E       
E       - []
E       + [('حَوَّامَتُي', ['ħ', 'a', 'u', 'aː', 'm', 't', 'iː']),
E       +  ('مُمْتِلِئَة', ['m', 'u', 'm', 't', 'i', 'l', 'i', 'ʔ', 'i'])]

tests/test_sqlite_phonemizer.py:16: AssertionError

More tests fail like this, but it becomes an awful big post if I paste them all :see_no_evil:

synesthesiam commented 1 year ago

I'd suggesting looking at piper instead of Mimic 3. It's where I'm spending my effort these days working for Nabu Casa.

I don't know when I'll have time to come back to gruut, unfortunately.

msftcangoblowm commented 1 year ago

Strategy on how to tackle test_en.py test_times

test_en.py -- test_times

text = "4:01am and 4:01 p.m."

In text_processor.TextProcessor.process, inline function, in_inline_lexicon

\# Do multiple passes over the graph
num_passes_left = max_passes
while num_passes_left > 0:
    ...
    if detect_times:
         if pipeline_transform_window(
             self._collapse_time, graph, root, window_size=2
         ):
            was_changed = True

            if pipeline_transform(self._transform_time, graph, root):
                was_changed = True
    ...

lang.py

EN_TIME_PATTERN = re.compile(
    r"""^((0?[0-9])|(1[0-1])|(1[2-9])|(2[0-3]))  # hours
         (?::
         ([0-5][0-9]))?                          # minutes
         \s*(a\.m\.|am|pm|p\.m\.|a\.m|p\.m)? # am/pm
         $""",
    re.IGNORECASE | re.X,
)

During the while loop Node4 "and " Node5 "4:01 p.m." <-- parent node NOT identified cuz it's not in the while loop Node6 "4:01 p.m" <-- leaf node. Identified correctly as Time. False positive

The issue is Node5 and Node6 are both valid Time according to the regex. Changing the regex does not solve the issue, cuz Node5 is being ignored and Node6 isn't. Rather than fighting, lets just go with the flow and work with the Node we got, Node6.

So to repeat, within while loop, Node4 and Node6 are accessible. Node5 isn't! This is really frustrating. Makes ya wanna shed a tear. So sad.

Suggestion

when the false positive Node6 is identified (correctly) as a Time, have code to look at the parent Node (Node5). If the parent is identified as a valid Time, mark the parent, not the leaf. Then during the next iteration of the while loop, hopefully(TM), Node6 will be ignored.

If during subsequent while loop iterations, Node6 doesn't get ignored, the code fix will run every iteration. Find that the parent Node (Node5) is already marked as a Time and not mark the leaf Node (Node6)

Note

The text_processor navigating the Node tree is not for the faint of heart. It's using itertools recipes. So it's like a puzzle with pieces missing cuz can't inspect an Iterator without affecting the Iterator. Most coders, myself included, are not familiar with itertools. So tracking down the cause is a daunting time consuming task.