latex3 / lua-uni-algos

Unicode algorithms for use by LuaLaTeX packages
2 stars 4 forks source link

Feature request: Unicode word boundaries #2

Closed zepinglee closed 1 year ago

zepinglee commented 1 year ago

Hi! I'm working on the Unicode-based engine citeproc-lua which requires conversion between sentence case and title case for titles. At the moment I'm using a naive word splitter to implement that but it can't handle some complex cases like “Foo—‘bar’ baz”. It would be great help if this library provides the Unicode word boundary features in section 4 of UAX #29.

Related projects in other languages:

zauguin commented 1 year ago

I added the underlying segmentation logic in 23660d9e18a3dcd1e20a6206723b873d68951cdc but I'm not quite sure what the best interface for this would be. The underlying issue there is that word boundaries require arbitrary look-ahead, so the grapheme boundary interface which only looks one character ahead, doesn't work. It depends a bit on what kind of sequence the codepoints are coming from, probably an iterator adapter is the most generic solution.

@zepinglee Do you only need this for strings or for something else?

zepinglee commented 1 year ago

Thanks! The word boundary segmentation is only needed for strings. I don't have much experience in the Unicode-related methods so I can't decide the best interface. Both UBreakIterator from ICU4C and split_word_bounds() are ok to me.

zauguin commented 1 year ago

I now added an interface based on the existing glyph cluster interface. See the documentation for details, but basically you can do

kpse.set_program_name'texlua'
local words = require'lua-uni-words'
for end_offset, start_offset, substring in words.word_boundaries"This will be split into UAX #29 word segments!" do
  print(substring)
end
zepinglee commented 1 year ago

Great! It works fine for me.