starwing / luautf8

a utf-8 support module for Lua and LuaJIT.
MIT License
412 stars 68 forks source link

Implementing grapheme cluster segmentation support #46

Closed alexdowad closed 1 year ago

alexdowad commented 1 year ago

Hey, @starwing... I am thinking of implementing grapheme cluster support for this library. I might do it very soon, or maybe later, but first it will be good to hear from you: Do you have any ideas what the API should look like?

starwing commented 1 year ago

Unfortunately I have no idea about this :( for glyph clusters, we need a double layer of iteration. But it's not comfortable for current Lua for-based iterations. And just return tables of tables seems too heavy. Maybe it's better to just implement a routine that accepts a char/byte offset and returns a iterator that iterates the current glyph cluster?

alexdowad commented 1 year ago

@starwing OK.

What if I provide multiple iterators (all of them compatible with for)? We could have one iterator which returns each grapheme cluster as a string, then a different one which returns each cluster as a list of numeric codepoints.

starwing commented 1 year ago

We could have one iterator which returns each grapheme cluster as a string

It's the only route we need to implement, because we have utf8.codes() already.

But that would make allocations of strings in each iteration.

alexdowad commented 1 year ago

It's the only route we need to implement, because we have utf8.codes() already.

But that would make allocations of strings in each iteration.

Exactly. We already have to decode the UTF-8 code units to segment grapheme clusters; so if we provide an iterator which returns codepoints rather than substrings, it could potentially increase performance. On the other hand, it means your library would then have one more API function.

Would you prefer to keep the API minimal (but also limit performance), or expand the API a bit more (but provide higher-performance options for those who can use them)?

starwing commented 1 year ago

Would you prefer to keep the API minimal (but also limit performance), or expand the API a bit more (but provide higher-performance options for those who can use them)?

That is not the real issue. You cannot make a single iterator for glyph clusters, because a cluster has more than one code point. So either you make a iterator that accepts an offset, and yields only code points in one cluster, or makes a iterator that yields iterators that yields code points in each cluster, which is difficult to use.

alexdowad commented 1 year ago

That is not the real issue. You cannot make a single iterator for glyph clusters, because a cluster has more than one code point. So either you make a iterator that accepts an offset, and yields only code points in one cluster, or makes a iterator that yields iterators that yields code points in each cluster, which is difficult to use.

Well, you can easily make an iterator which yields substrings (each substring containing one grapheme cluster), and you can easily make an iterator which yields tables (each table containing a list of codepoints comprising one grapheme cluster).

Another option: since most grapheme clusters just contain a single codepoint, return an integer if the cluster only has one codepoint, or a table if it has more than one. Then the caller has to branch off type(cluster) to know what to do with the yielded value.

Another option: an iterator which just yields byte offset and byte length for each grapheme cluster. Then the caller can either extract a substring if that's what they want, or use a different function (maybe utf8.next) to iterate over the codepoints in the cluster.

That is 4 different options. My inclination right now is to implement option 3 and option 4.

Erutuon commented 1 year ago

The byte offset and length option (option 4), or a similar option of two byte positions to be used in string.sub, seems the least expensive in terms of memory and the most straightforward and versatile. It would easily allow you to get the first 5 graphemes from a string, whereas the other options would make that operation more complicated. With the substring iterator option, you'd have to concatenate strings (creates several intermediate strings), and with the code points option, you'd have to calculate the UTF-8-encoded length of all the code points up to the grapheme you're looking at and add them all up to get the byte positions. Code points and substrings can be easily gotten from byte positions (something like below, with the two byte positions version), but converting code points and substrings to byte positions is more complicated and error-prone.

local s = "ā́ḉẹ̆ĩou"
local i = 0
for s, e in utf8.grapheme_indices(s) do
  i = i + 1
  if i == 5 then
    print("First five graphemes", s:sub(1, e))
    print("Code points in first five graphemes", utf8.codepoint(s:sub(1, e)))
    break
  end
end

What I remember doing with graphemes (in Rust, not Lua) is 1. limiting the visible length of a printed string and 2. listing visual characters in Latin script with combining diacritics. Both could be done with the byte offset option.

I think it would be convenient for the iterator to yield two byte positions, like in my example, because then you can directly plug them into string.sub or utf8.codepoint to slice or get code points of the original string, which I imagine would be common. With byte offset and length you'd need to always remember the - 1 in string.sub(s, start, start + length - 1 (or the + 1 in string.sub(s, start + 1, start + length) ifstart` is a 0-based offset like in C).

alexdowad commented 1 year ago

@Erutuon I like the way you are thinking.

I would just add that it will still be good if grapheme_indices can optionally accept a byte offset to start iterating from, and a byte offset to end at. (By default, start will be the beginning of the input string, and stop will be the end.)

This adds more versatility with zero extra implementation effort.

alexdowad commented 1 year ago

Hmm, question... Is it better if grapheme_indices optionally takes byte offsets, or codepoint offsets?

alexdowad commented 1 year ago

Hmm. I have seen that the Lua "stateless iterator" pattern may not work well here... since it expects the first value returned by the iterator function to be the input which gets passed back to the iterator function for the next iteration.

I would like to return start, end byte offsets, but want to use end as the iteration state variable which gets passed back to the iterator function for the next iteration.

alexdowad commented 1 year ago

Hmm. I have seen that the Lua "stateless iterator" pattern may not work well here... since it expects the first value returned by the iterator function to be the input which gets passed back to the iterator function for the next iteration.

OK, I've overcome this issue.

starwing commented 1 year ago

OK, I've overcome this issue.

Just return a c closure to handle this.

alexdowad commented 1 year ago

My code is passing all 1187 official test cases for grapheme segmentation now.

Still needs to be fuzzed for 1) differences in behavior from ICU and 2) memory errors.

alexdowad commented 1 year ago

@starwing, I have found an issue here...

luautf8 has tables to identify codepoints within certain categories such as "control", "alphanumeric", etc. The table for "control" characters includes Unicode categories Cc (good!), Cf (also good!), and Co (very questionable!).

"Co" is private-use codepoints, like those which are higher than U+F0000. I find it quite strange that these are counted as "control" codepoints by luautf8.

I checked where luautf8 uses this table, and it's only used in one place: to handle matches for %c in utf8.match.

Do you really want %c to match private-use codepoints?? The only justification which I can imagine for this is that the Unicode category starts with "C". Otherwise it just seems really strange.

The reason why I discovered this is because I am using the existing cntrl_table to implement the Unicode grapheme segmentation rules, and putting private-use codepoints in that table breaks the correctness of segmentation. I am getting different results from ICU when there is a private-use codepoint followed by a combining mark.

(Thank you, thank you, thank you to the LLVM guys for libfuzzer! I would not have found this without fuzzing!)

starwing commented 1 year ago

Good catch! I choose the category from Python? perl? Vim? or iconv? at past. So it could be an error. It's okay to be fixed at this time. And that makes sense for fuzzy test! Maybe I should pay more attention to add GitHub actions for automated testing for tests and fuzzy tests.

alexdowad commented 1 year ago

@starwing, about fuzzing... since this library is designed to be compatible with Lua 5.3's utf8 module, fuzzing the common functions to see if there are any cases where their behavior differs from Lua 5.3 might be a very good idea.

alexdowad commented 1 year ago

@starwing, after adjusting cntrl_table to exclude private use codepoints, I just ran my fuzzer for 15 minutes without finding any other bugs. I don't know how many millions of test cases were checked by the fuzzer, but it is enough to give me significant confidence that the code is good.

Both ASan and UBSan were enabled when fuzzing and no memory bugs were found.

starwing commented 1 year ago

Great job! Pull request is welcome.

SlySven commented 11 months ago

:thinking: Just a thought - and without looking at the code - how does the code in the PR #47 handle Cn category code-points, particularly those that are "Other, not assigned" but which are permissible - I refer to the sub-category of "Non-characters", i.e. those that can have application specific meaning and which are suppose to be preserved in transfer across systems but which never have a visible representation.

alexdowad commented 11 months ago

🤔 Just a thought - and without looking at the code - how does the code in the PR #47 handle Cn category code-points, particularly those that are "Other, not assigned" but which are permissible

@SlySven, thanks for the inquiry. It sounds like the category you are referring to must be the same as the Private Use Areas.

I don't know how my code handles those codepoints without checking, and I don't have time to check; probably it treats each one as a separate grapheme. What I do know is that all the test cases provided by the Unicode Consortium pass. Further, I used a fuzzer to check many millions of random test cases to see if there was any difference in behavior from ICU, and the fuzzer could not find any.

So it's very, very likely that what lua-utf8 does is the same as what ICU does.

SlySven commented 10 months ago

@ SlySven, thanks for the inquiry. It sounds like the category you are referring to must be the same as the Private Use Areas.

Well, I'd expect PUA graphemes to be handled individually - as there is no way to know if they cluster with anything else (a long time ago I did some font glyph designs for some that would actually be paired together - but that was for a very specific application - the TinTin++ MUD {Multi-User Dungeon) client so that it could draw "maps" where each "room" and it's exits were shown by two such glyphs shown side by side - and I patched the application to do the work so clustering was not relevant/required). OTOH "non-characters" are not intended to be seen so clustering is just "not applicable". :grinning:

They do need to be special cased in some other situations though - as I pointed out in another project: https://github.com/ridiculousfish/widecharwidth/issues/10 .