roc-lang / unicode

Universal Permissive License v1.0
7 stars 5 forks source link

Grapheme text segmentation and test suite #3

Closed lukewilliamboswell closed 8 months ago

lukewilliamboswell commented 10 months ago

This PR;

NOTE implementation of Extended Grapheme Cluster requires the implementation of rules GB9a, GB9b, GB9c which are left for a future PR.

Run Generation Scripts

To re-generate the generated files you can use bash rebuild.sh

Tests

To run the tests for Grapheme test suite use roc test package/GraphemeTest.roc

Screenshot 2023-12-17 at 20 33 25

Examples

I tried to include an additional example that used Grapheme.split but there are significant compiler bugs that prevented me from including with this PR.

Here is an demo from the tests showing the function in use.

Screenshot 2023-12-17 at 20 34 01
rtfeldman commented 10 months ago

@lukewilliamboswell Just checking - should I hold off on review until the tests are passing? (I saw in the description you mentioned the TODOs, but I wanted to check!)

lukewilliamboswell commented 10 months ago

Thank you for clarifying. I think those changes will be more suited for another PR. I suspect it is going to be a challenge, at least I need to learn a lot more about emoji before then, and we may need to change the approach/algorithm to do it. If you have feedback on these changes that would be most appreciated, thank you.

lukewilliamboswell commented 9 months ago

Update on this PR; I've re-written the script for generating the test suite, currently called GraphemeTestGen2.roc. Now I can filter tests to include or exclude based on the rules (or capabilities) they are testing. This is a significant improvement as now I can see where there are significant gaps in the implementation, and progressively improve support for the text segmentation rules.

I've also started on a new implementation of the algorithm for text segmentation currently called Grapheme2.roc