Variation in GFLang orthography definitions

NeilSureshPatel commented 1 year ago

@simoncozens @moyogo I have been reviewing the GFLang dataset and I am seeing a mix in the way character sets are catalogued.

Here are two examples:

bas_Latn

exemplar_chars {
  base: "a á à â ǎ ā {a᷆}{a᷇} b ɓ c d e é è ê ě ē {e᷆}{e᷇} ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆}{ɛ᷇} f g h i í ì î ǐ ī {i᷆}{i᷇} j k l m n ń ǹ ŋ o ó ò ô ǒ ō {o᷆}{o᷇} ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆}{ɔ᷇} p r s t u ú ù û ǔ ū {u᷆}{u᷇} v w y z {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇}"
  auxiliary: "q x"
  marks: "◌̀ ◌́ ◌̂ ◌̄ ◌̌ ◌᷆ ◌᷇"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  index: "A B Ɓ C D E Ɛ F G H I J K L M N Ŋ O Ɔ P R S T U V W Y Z"
}

bin_Latn

exemplar_chars {
  base: "A B D E F G H I K L M N O P R S T U V W Y Z Á É È Ẹ Í Ó Ò Ọ Ú a b d e f g h i k l m n o p r s t u v w y z á é è ẹ í ó ò ọ ú \'"
  marks: "◌̀ ◌́ ◌̣"
}

bas_Latn has base and auxiliary characters broken out whereas bin_Latn does not. Also what is interesting is that the bas_latn also maps out the base/mark pairs that are not precomposed. I really like having that data right in GFLang. As we update GFLang we could make this a consistent practice. Then we could run the no_orphaned_marks check by default and not have to manually add it to a shaperglot profile.

Secondly, we probably should also have the orthographies check look for an auxiliary category and test for those glyphs as well.

NeilSureshPatel commented 1 year ago

It looks these non-precomposed pairs in the ortho list cause the following error when running shaperglot.


  File "/home/neilspatel/.local/bin/shaperglot", line 8, in <module>
    sys.exit(main())
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/cli.py", line 97, in main
    options.func(options)
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/cli.py", line 47, in check
    results = checker.check(langs[lang])
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checker.py", line 32, in check
    check_object.execute(self)
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checks/orthographies.py", line 30, in execute
    missing = [x for x in self.bases if ord(x) not in checker.cmap]
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checks/orthographies.py", line 30, in <listcomp>
    missing = [x for x in self.bases if ord(x) not in checker.cmap]
TypeError: ord() expected a character, but string of length 4 found```

If we decide to include non-precomposed base/mark pairs in the ortho we need to filter them out for the orthographies check.

simoncozens commented 1 year ago

OK, while we are waiting for gflang to be better, I'll improve the parsing of decomposed glyphs.

simoncozens commented 1 year ago

I've changed things so that we can test whether either precomposed or decomposed glyphs are shapable by the font. To do this, I've run the glyph-or-glyphs through Harfbuzz and looked for any .notdefs - this is actually a better test than going through the cmap table, so this is a nice improvement.

NeilSureshPatel commented 1 year ago

Thanks, I'll give it try. That does sound more robust. I guess its more like how you do it in the orphaned marks test.

moyogo commented 1 year ago

@simoncozens Does this glyph-or-glyphs do the same as https://github.com/enabling-languages/python-i18n/wiki/icu.CanonicalIterator ?

NeilSureshPatel commented 1 year ago

I don't believe it does check all permutations, but that would be nice rather than having to include all permutations in the exemplar characters.

NeilSureshPatel commented 1 year ago

I guess we really don't anything as complicated as icu.CanonicalIterator, since harfbuzz is going to collapse what it can into precomposed marks. All that is required is getting all permutations of the mark sequence following the base. I just built something in the script I am using to generate the no_orphaned_marks tests that seems to work.

    for basemark in basemarks:
        if len(basemark) > 2:
            base_only = basemark[0]
            marks_only = basemark[1:len(basemark)]
            for i in itertools.permutations(marks_only, len(marks_only)):
                new_basemark = base_only.join(i)
                new_basemarks.append(new_basemark)

In anycase, I don't think we want the all the permutations in gflang. We already have the subsets of all multiple mark combinations in the exemplar_chars list. We should be able to handle all the combinations either directly in the shaperglot check or enumerate them in the shaperglot language profile.

simoncozens commented 1 year ago

Yeah, I'm being silly. We definitely don't want all permutations in gflang, because if we want decomposition we can handle it ourselves.

I don't even know if it's worthwhile to test all decomposition permutations, given that the first thing Harfbuzz is going to do when it sees the text is normalize it.

NeilSureshPatel commented 1 year ago

I think I need to do a couple tests to be sure. If I recall correctly, for Yoruba e\u0323\u0300 and e\u0300\u0323 has different cluster behavior, which can result in one sequence having an orphaned mark and the other not. Let me confirm this.

simoncozens commented 1 year ago

That sort of thing is certainly true for syllabic scripts like Myanmar: which is precisely why we don’t want to be throwing every permutation at the shaper - not all of them will be orthographically correct.

NeilSureshPatel commented 1 year ago

That probably means that permutations are best handled specifically in the shaperglot language profile when it is orthographically appropriate and not automatically within the shaperglot check.

NeilSureshPatel commented 1 year ago

@simon, correct me if I am wrong, but the way Shaperglot instantiates HarfBuzz there is no font fall back so we are only looking at the specific fonts being tested.

NeilSureshPatel commented 1 year ago

I think I need to do a couple tests to be sure. If I recall correctly, for Yoruba e\u0323\u0300 and e\u0300\u0323 has different cluster behavior, which can result in one sequence having an orphaned mark and the other not. Let me confirm this.

Ok, I ran this test through shaperglot and printed out the buffers. Test1 is e\u0323\u0301 against e\u0301\u0323 and test2 is e\u0323\u0301 against é\u0323

- check: shaping_differs
  inputs:
    - text: "ẹ́"
    - text: "ẹ́"
      language: "ro"
  differs:
    - cluster: 0
      glyph: 0
    - cluster: 0
      glyph: 0
  rationale: "in Yoruba"
- check: shaping_differs
  inputs:
    - text: "ẹ́"
    - text: "ẹ́"
      language: "ro"
  differs:
    - cluster: 0
      glyph: 0
    - cluster: 0
      glyph: 0
  rationale: "in Yoruba"

The output is definitely normalized.

Test1
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0

Test2
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0

I think that settles the question about permutations.

googlefonts / shaperglot

Variation in GFLang orthography definitions #7