rendello / layout

0 stars 0 forks source link

Add tests for UTF-8 edge-cases in upper- and lowercasing #8

Open rendello opened 1 week ago

rendello commented 1 week ago

7 was closed by 2cdbd24, but I feel uncomfortable that the supposed bug I'd found while changing the way things worked had never reared its head in the previous property tests, and I feel uncomfortable that the new code is always going to do what I want, due to some of the logic being more complex.

I'd like to use Python to generate a few big tables of UTF-8 characters with the following characteristics:

Unicode defines casing properties, see here. These can be used to create the first "table" without having to write it to code. I imagine the second two tables will be pretty small.

rendello commented 1 week ago

The "no case change" characters should be run through with must_normalize set to both true and false with the returns expected to be the same.

rendello commented 1 week ago

I'm having difficulty with characters like ff, which can decompose to multiple characters, which would ruin my indexing when the test is written. I think I should finish the tests, but I think I'll replace to lower() call to a custom one build from generate.py, which only normalizes characters in table.tsv. In fact, if I do it this way, I can probably replace this pop_syllabic_unit() function again, using a normalization table with numbers for an offset change. ie. Ł to ł will be -1 byte, and if there's one occurrence, then it's 1*-1=-1 from the original offset.

rendello commented 6 days ago

Data files added in 2a1283e & 1e69ae8.

rendello commented 6 days ago

Related Gist.

rendello commented 6 days ago

Tests cases added in 24bc640. The commit message says it all, however:

No bugs found, despite a seemingly obvious one in pop_syllabic_unit, it's clear that the test properties will have to be more granular somehow.

rendello commented 6 days ago

When viewing the characters in a web browser with the extension active, you can observe only one bug:

Screenshot 2024-09-16 at 4 20 57 PM

Here, "İ" is rendered in red as an Inuktitut word (consisting of a single syllabic unit). What I think is happening here is that the single char İ is lowered to , which has two codepoints. The first codepoint is a regular ASCII i, which is considered a valid latin SyllabicUnit. The index advances one character, which would leave the combining dot in the lowercase version, but in the original buffer, one character is the entire string. So, the lowercase of İ is treated as i.

If I generated my tables correctly, this character would be the only one this would happen to, so it's probably not a huge deal, but should still be dealt with.