Open cbarrick opened 7 years ago
I will work on this.
@cryze can you expain your CharOrBoundary
idea?
The problem is that unicode-segmentation is built around borrowing str slices from some owned data, like a String. That doesn't work well with char iterators, as those aren't stored anywhere. So to support this at all, you would need to introduce a separate streaming API, that provides Iterator<Item = CharOrBoundary>
iterators with CharOrBoundary being
pub enum CharOrBoundary {
Char(char),
Boundary,
}
So you can then iterate over the characters and it'll tell you whether you hit a boundary or not. You could then have other helper functions that help you collect char segments between the boundaries into buffers.
+1 to this idea, would also be useful when working with non-UTF8 strings in legacy APIs.
To clarify, I think that providing the full UnicodeSegmentation
on top of Iterator<Item=char>
is hopeless because the existing UnicodeSegmentation
API heavily assumes access to an underlying &str
all over the place, and in the case of an Iterator<Item=char>
there may not be one.
What we could provide, however, is something that turns an Iterator<Item=char>
into an Iterator<Item=Iterator<Item=char>>
of sorts that represents graphemes or words (may need to be a streaming iterator if we don't want to impose a Clone bound on the underlying Iterator, or if we want to avoid parsing the text twice).
Since GraphemeCursor::next_boundary()
already works on top of a char iterator, it might be possible to rewrite it in terms of this API in order to avoid code duplication. For words, it's less clear how to proceed, as the implementation makes even more UTF-8 string assumptions, such as manipulating string indices under the hood.
I looked into it further and tried to adapt parts of the GraphemeCursor
implementation to streaming use cases. From this experiment, it seems to me that it is impossible to provide both of the following API properties at the same time while keeping the implementation sane:
The reason is that in the current API, extra input is "patched together" with existing one using UTF-8 indices as a unifying abstraction, and AFAIK there is no nice equivalent in an Iterator<Item=char>
world. An incomplete replacement would be some ability to attach an extra iterator at the end of the existing one using Iterator::chain()
, but that would not address the full generality of the current API, which can also work with overlapping chunks of UTF-8 (though whether one would want to ever use them is up for debate).
So unless I'm missing something obvious, it seems to me that the least bad option is to stick with the existing code, collect the iterator of chars into a (possibly truncated) UTF-8 string and use unicode_segmentation
on that.
It would be nice to segment character iterators, especially for interoperability with the
unicode-normalization
crate. This could provide a solution to #7 when/ifio::Chars
stabilizes. In particular, I'd like to write a tokenizer like this:One issue I see is that most of the public structs provide an
as_str
method that returns "the underlying data (the part yet to be iterated) as a slice of the original string". This obviously won't work with streaming types.