unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
570 stars 56 forks source link

Implement UnicodeSegmentation for Iterator<item = char> #28

Open cbarrick opened 7 years ago

cbarrick commented 7 years ago

It would be nice to segment character iterators, especially for interoperability with the unicode-normalization crate. This could provide a solution to #7 when/if io::Chars stabilizes. In particular, I'd like to write a tokenizer like this:

let input: BufRead = my_input();
let tokens = input.chars().nfkc().split_word_bounds();

One issue I see is that most of the public structs provide an as_str method that returns "the underlying data (the part yet to be iterated) as a slice of the original string". This obviously won't work with streaming types.

ghost commented 7 years ago

I will work on this.

ghost commented 7 years ago

@cryze can you expain your CharOrBoundary idea?

CryZe commented 7 years ago

The problem is that unicode-segmentation is built around borrowing str slices from some owned data, like a String. That doesn't work well with char iterators, as those aren't stored anywhere. So to support this at all, you would need to introduce a separate streaming API, that provides Iterator<Item = CharOrBoundary> iterators with CharOrBoundary being

pub enum CharOrBoundary {
    Char(char),
    Boundary,
}

So you can then iterate over the characters and it'll tell you whether you hit a boundary or not. You could then have other helper functions that help you collect char segments between the boundaries into buffers.

HadrienG2 commented 5 years ago

+1 to this idea, would also be useful when working with non-UTF8 strings in legacy APIs.

HadrienG2 commented 5 years ago

To clarify, I think that providing the full UnicodeSegmentation on top of Iterator<Item=char> is hopeless because the existing UnicodeSegmentation API heavily assumes access to an underlying &str all over the place, and in the case of an Iterator<Item=char> there may not be one.

What we could provide, however, is something that turns an Iterator<Item=char> into an Iterator<Item=Iterator<Item=char>> of sorts that represents graphemes or words (may need to be a streaming iterator if we don't want to impose a Clone bound on the underlying Iterator, or if we want to avoid parsing the text twice).

Since GraphemeCursor::next_boundary() already works on top of a char iterator, it might be possible to rewrite it in terms of this API in order to avoid code duplication. For words, it's less clear how to proceed, as the implementation makes even more UTF-8 string assumptions, such as manipulating string indices under the hood.

HadrienG2 commented 5 years ago

I looked into it further and tried to adapt parts of the GraphemeCursor implementation to streaming use cases. From this experiment, it seems to me that it is impossible to provide both of the following API properties at the same time while keeping the implementation sane:

  1. Ability to work with an incomplete view of the input string and add more of it as needed.
  2. Ability to work with streams of char.

The reason is that in the current API, extra input is "patched together" with existing one using UTF-8 indices as a unifying abstraction, and AFAIK there is no nice equivalent in an Iterator<Item=char> world. An incomplete replacement would be some ability to attach an extra iterator at the end of the existing one using Iterator::chain(), but that would not address the full generality of the current API, which can also work with overlapping chunks of UTF-8 (though whether one would want to ever use them is up for debate).

So unless I'm missing something obvious, it seems to me that the least bad option is to stick with the existing code, collect the iterator of chars into a (possibly truncated) UTF-8 string and use unicode_segmentation on that.