Test consistency of segmentation with canonical equivalence

eggrobin commented 1 year ago

We promise that our segmentation algorithms are consistent with NFD:

To maintain canonical equivalence, all of the following specifications are defined on text normalized in form NFD, as defined in Unicode Standard Annex #15, “Unicode Normalization Forms” [UAX15]. A boundary exists in text not normalized in form NFD if and only if it would occur at the corresponding position in NFD text. However, the default rules have been written to provide equivalent results for non-NFD text and can be applied directly.

However, we do not test that here (otherwise we would have spotted the Kirat Rai issue in #445).

It should not be too difficult to write an ICU monkey test for that (I am working on a similar one to investigate the ancient LB-GCB inconsistency AIs), but that would not be enough: we only run the ICU tests with new rules and data relatively late in the beta, whereas here this was spotted pre-alpha, and put encoding model questions on the table (we kept the encoding as previously decided in that case, but it is not implausible that we could decide differently in another case).

Could we we have segmentation monkeys in unicodetools?

eggrobin commented 1 year ago

(Alternatively: Is this doable as a proper monkey-free invariant test? The LB-GCB inconsistencies are a mess because both algorithms are quite complicated; but normalization is comparatively simple, maybe we can actually prove things.)

macchiati commented 1 year ago

Both tests would be good to add in unicode tools... (for the GCB-anythingelse consistency, monkey tests would be simplest).

On Wed, Aug 16, 2023 at 11:08 AM Robin Leroy @.***> wrote:

(Alternatively: Is this doable as a proper monkey-free invariant test? The LB-GCB inconsistencies are a mess because both algorithms are quite complicated; but normalization is comparatively simple, maybe we can actually prove things.)

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/522#issuecomment-1681063123, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGMZYHAZNDEIR5IWKDXVUEALANCNFSM6AAAAAA3SZFIKQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

unicode-org / unicodetools

Test consistency of segmentation with canonical equivalence #522