unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
565 stars 57 forks source link

`GraphemeCursor::next_boundary()` returns incorrect boundary #115

Open noib3 opened 1 year ago

noib3 commented 1 year ago

The grapheme boundaries of "🇷🇸🇮🇴" should be 8 and 16, but by feeding GraphemeCursor the individual RIS codepoints I get 8 and 12. Am I using the API incorrectly or is this a bug?

use unicode_segmentation::{GraphemeCursor, GraphemeIncomplete};

fn main() {
    let s = "🇷🇸🇮🇴";

    let mut cursor = GraphemeCursor::new(0, s.len(), true);

    // 🇷🇸

    match cursor.next_boundary("🇷", 0) {
        Err(GraphemeIncomplete::NextChunk) => {}
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇸", 4) {
        Err(GraphemeIncomplete::PreContext(4)) => {
            cursor.provide_context("🇷", 0);
        }
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇸", 4) {
        Err(GraphemeIncomplete::NextChunk) => {}
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇮", 8) {
        Err(GraphemeIncomplete::PreContext(8)) => {
            cursor.provide_context("🇸", 4);
        }
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇮", 8) {
        Err(GraphemeIncomplete::PreContext(4)) => {
            cursor.provide_context("🇷", 0);
        }
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇮", 8) {
        Ok(Some(8)) => {}
        _ => unreachable!(),
    }

    // 🇮🇴

    match cursor.next_boundary("🇮", 8) {
        Err(GraphemeIncomplete::NextChunk) => {}
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇴", 12) {
        Err(GraphemeIncomplete::PreContext(12)) => {
            cursor.provide_context("🇮", 8);
        }
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇴", 12) {
        Err(GraphemeIncomplete::PreContext(8)) => {
            cursor.provide_context("🇸", 4);
        }
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇴", 12) {
        Err(GraphemeIncomplete::PreContext(4)) => {
            cursor.provide_context("🇷", 0);
        }
        _ => unreachable!(),
    }

    match cursor.next_boundary("🇴", 12) {
        Ok(Some(16)) => {}
        Ok(Some(12)) => panic!("this should be 16"),
        _ => unreachable!(),
    }
}
Manishearth commented 1 year ago

Seems like the regional indicator state isn't getting properly reset by the cursor.

Also I don't think the cursor should have to ask for precontext if you've been feeding it stuff from the beginning.

I don't fully understand how the cursor works and don't have time right now to pick that up, @raphlinus would you be able to take a look?

Manishearth commented 1 year ago

Potentially related: https://github.com/unicode-rs/unicode-segmentation/issues/118