unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
571 stars 56 forks source link

panic when index inside unicode sequence - Questions about scope of this crate #120

Closed KMikeeU closed 1 year ago

KMikeeU commented 1 year ago

I stumbled upon an issue, where calling functions like GraphemeCursor::is_boundary would panic whenever the index is in the middle of a unicode sequence (ie the index is not at a 'char' boundary). For example:

#[test]
fn test_grapheme_cursor_inside_unicode() {
    // 0xf0 0x9f 0x98 0x82
    let s = "😂";
    // Start the cursor at the second byte, which is inside the unicode sequence for the emoji
    let mut c = GraphemeCursor::new(1, s.len(), true);

    assert_eq!(
        c.is_boundary(s, 0),
        Ok(false)
    )
}

This panics at is_boundary:

---- grapheme::test_grapheme_cursor_inside_unicode stdout ----
thread 'grapheme::test_grapheme_cursor_inside_unicode' panicked at 'byte index 1 is not a char boundary; it is inside '😂' (bytes 0..4) of `😂`', src/grapheme.rs:553:22
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/str/mod.rs:86:9
   4: core::str::traits::<impl core::slice::index::SliceIndex<str> for core::ops::range::RangeFrom<usize>>::index
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/str/traits.rs:370:21
   5: core::str::traits::<impl core::ops::index::Index<I> for str>::index
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/str/traits.rs:65:9
   6: unicode_segmentation::grapheme::GraphemeCursor::is_boundary
             at ./src/grapheme.rs:553:22
   7: unicode_segmentation::grapheme::test_grapheme_cursor_inside_unicode
             at ./src/grapheme.rs:809:9
   8: unicode_segmentation::grapheme::test_grapheme_cursor_inside_unicode::{{closure}}
             at ./src/grapheme.rs:804:42
   9: core::ops::function::FnOnce::call_once
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5
  10: core::ops::function::FnOnce::call_once
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5

Is this crate supposed to handle such issues? Or are they supposed to be prevented by checking these conditions before calling functions that can panic?

Manishearth commented 1 year ago

Yeah, in the current implementation is panicky here. I think yes, you should check the conditions before calling these functions. We could probably in the long run make them more robus by making them fallible.

In general the cursor APIs have some bugs and need more polish, unfortunately. I haven't had the time to look into that.