unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
572 stars 56 forks source link

`UWordBoundIndices` doesn't expose the indices #35

Open wez opened 6 years ago

wez commented 6 years ago

As far as I can tell, UWordBoundIndices is just a wrapper around UWordBounds with an identical interface.

In my use case I have a line of text and an index into the .chars() of that string from a mouse double click and I need to obtain the indices of the start and end of the word that enclose that index.

It seemed to me that UWordBoundIndices is what I'd want here, but I don't see how to use it for this purpose. Is this an oversight, or is there a better way to do get the result I'd like?

wez commented 6 years ago

Oh, is it just that the docs at https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/struct.UWordBoundIndices.html are stale?

tapeinosyne commented 6 years ago

The UWordBoundIndices iterator definitely yields word indices, and the docs don't appear stale. However, you are right in that the current interface isn't suitable for identifying word boundaries from random access.

Graphemes suffered the same issue prior to the introduction of a cursor API in #21, and I suppose that word segmentation could be similarly updated.

wez commented 6 years ago

The problem I had was that that critical portion of the docs on that page:

type Item = (usize, &'a str)

is buried a bit further down in the page (that's just how they render), so I was left to fixate on the as_str() method. Would you mind expanding the doc comment to something like this to make it a little clearer?

External iterator for word boundaries and byte offsets. Yields (usize, &str), the byte offset and string slice for each word.

I would love to have an API directed at random access! I have this somewhat clunky solution for the moment:

  for (x, word) in line.split_word_bound_indices() { 
     if event.x < x {
        break;
     }
     if event.x <= x + word.len() {
        // this is the matching word
       return;
     }
  }
tapeinosyne commented 6 years ago

that critical portion of the docs […] is buried a bit further down in the page (that's just how they render), so I was left to fixate on the as_str() method. Would you mind expanding the doc comment to something like this to make it a little clearer?

Yep, it can be pretty easy to miss things. Trait impls often look a bit lost in the rendered page, and the convention established by the standard library is that the behavior of iterators is documented on their builder method rather than the struct itself. I'll add a comment.

I would love to have an API directed at random access! I have this somewhat clunky solution for the moment:

I'd be happy to work on it, but before that I wouldn't mind seeing some consolidation between the unicode-rs organization and the recent, seemingly more active unic. That's a conversation that should be started, although not here. @Manishearth, could I maybe ping you on IRC to get a sense of where we stand, or would you rather I opened an issue/forum thread directly?