unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
565 stars 57 forks source link

Feature for implementation on str, and specific Grapheme type? #105

Closed th317erd closed 2 years ago

th317erd commented 2 years ago

Hello! I am fairly new to Rust, and in the process of learning am creating a tokenizer crate (for string parsing). In my work I needed to match on full graphemes, and so stumbled upon your library, and started trying to use it. I ran into a few issues when trying to use your library (may be my lack of understanding), and these issues I ran into eventually all just turned into questions. So, here is a rough overview of my journey:

1) In the tokenizer crate I am slowly working on as I learn Rust, I need to be able to know what type of data I am working with. Knowing I have a string slice, a char or char slice, or a grapheme is important. In order to accomplish this, I aliased the &str type as follows: pub type Grapheme<'a> = &'a str;. I then modified your library to return Grapheme everywhere it used to return a &str. 2) This is when I started having issues... because I simply wanted to modify/add to the interface of your crate to wrap graphemes (for example) to return an iterator over Grapheme instead of &str. The issue I ran into when I did this is that this change I made required all calling code to use turbo-fish syntax... not ideal... 3) This got me thinking "Wouldn't it be nice if this crate had a feature where I could opt-out of the impl UnicodeSegmentation for str?

So, to sum up, my questions are as follows (remember I am a newb... please inform/teach me if this doesn't make sense):

1) Wouldn't it make sense to have a specific Grapheme type instead of &str, that can then implement its own traits and functionality for Grapheme? 2) Could we please add a feature to the crate to prevent implementing the UnicodeSegmentation trait on str? This way the library could be used without polluting str. The feature config could be something like:

[features]
default = [ "impl-str" ]
impl-str = []

Such that a default-features = false could then be used to disable the interface on str?

If any of this is a good idea, I would be happy to create a PR. Please let me know your thoughts. Thank you!

th317erd commented 2 years ago

Oh, also, I forgot a question. I also added the following methods myself, but they seem like they should just be part of this crate. Thoughts? Would this be a good addition to the crate?

pub type GraphemeRange = Range<usize>;

...

#[must_use]
  fn grapheme_range_to_byte_range(&self, range: GraphemeRange) -> Range<usize> {
    if range.end <= range.start {
      return Range { start: 0, end: 0 };
    }

    let mut indices = unicode_segmentation::grapheme_indices(self, true);

    let obtain_index = |(index, _char)| index;
    let str_len = self.len();

    let start_offset = indices.nth(range.start).map_or(str_len, &obtain_index);
    let end_offset = indices
      .nth(range.end - range.start - 1)
      .map_or(str_len, &obtain_index);

    Range {
      start: start_offset,
      end: end_offset,
    }
  }

  #[must_use]
  fn byte_index_to_grapheme_index(&self, index: usize) -> Option<usize> {
    if index >= self.len() {
      return None;
    }

    let indices = unicode_segmentation::grapheme_indices(self, true);
    let mut grapheme_index: usize = 0;

    for iter in indices.skip(1) {
      if index < iter.0 {
        return Some(grapheme_index);
      }

      grapheme_index += 1;
    }

    // We already checked that index is within the boundary of the string,
    // so if we got here we know that this is the last grapheme index
    Some(grapheme_index)
  }

  // Return size of string in graphemes
  #[must_use]
  fn glen(&self) -> usize {
    unicode_segmentation::graphemes(self, true).count()
  }

...

Again, I figured GraphemeRange should be its own type, to differentiate (for example) from a byte range. Again, this also allows implementing traits on this type. The byte_index_to_grapheme_index also made sense to me, to be able to find a grapheme from a byte index.

Thoughts?

This would enable user to do things like (using the nightly Pattern API):

let grapheme_index = unicode_segmentation::byte_index_to_grapheme_index(str.find("pattern").unwrap());
Manishearth commented 2 years ago

I don't think such a feature would be that generally useful: I'd recommend using the iterator with a .map() wrapping it with whatever type you need. I think the issues you're having are because you're trying to wrap the crate more invasively than necessary.

Manishearth commented 2 years ago

As for grapheme_range_to_byte_range, I don't really like methods that do a nonobvious expensive iteration, that would be necessary here. The typical tips for working with strings are to iterate and maintain appropriate state during iteration, not jump around and recalculate each time.

th317erd commented 2 years ago

I feel like your explanation here is very lacking. If I wanted to for example, write a "substring" method that would work off graphemes instead of chars or bytes, then grapheme_range_to_byte_range makes absolute sense. How else would someone write a string.get(???) without such a method?

The byte_index_to_grapheme_index also makes a lot of sense to me, as one can then use the Pattern methods for a string slice.

Third, you never even mentioned glen, which is also quite useful.

Could you please take a moment to explain how and why you shot this down so quickly? As I said, I am trying to learn, and trying to help here. So a "no, I don't think so, because I don't think so" isn't very helpful to me. What is your reasoning behind your not liking any of this?

You also never commented on the feature to stop pollution of the str type... how would this not be a useful feature for the library?

Manishearth commented 2 years ago

Gonna say off the bat that I do not appreciate your tone here: I am a volunteer maintainer and I am not obliged to spend time walking you through things, especially when your response seems so aggressive.

Nevertheless, I will try to address your questions.

If I wanted to for example, write a "substring" method that would work off graphemes instead of chars or bytes, then grapheme_range_to_byte_range makes absolute sense

There's a reason why there's not a similar code point indexing method on strings: the char_indices() and grapheme_indices() APIs are both designed to encourage people to handle the iteration themselves and store any information they want the way they need, which is the most efficient way of doing things. grapheme_range_to_byte_range is an expensive method and can easily lead to accidentally quadratic code. These APIs are designed so that it's up to the user of this crate to figure out a bookkeeping method for indices that works for their use case.

This is true for the other method too: both are somewhat costly. Avoiding this kind of thing is somewhat standard in Rust API design, you don't want costly iteration methods to look like they are just indexing.

Third, you never even mentioned glen, which is also quite useful.

You asked me like four questions at once, I missed one. Chill out.

.glen() is useful but I'll again make a comparison with chars() and char_indices() and point out that Rust does not have .chars_len() by default because it wants you to acknowledge the cost. This is another instance of the same Rust API design standard.

As I said, I am trying to learn, and trying to help here. So a "no, I don't think so, because I don't think so" isn't very helpful to me.

You are welcome to ask me to teach you, but I have no obligation to do so, it depends on how much time I have.

You also never commented on the feature to stop pollution of the str type... how would this not be a useful feature for the library?

I have literally never seen a Rust crate do this -- features are used for disabling impls of foreign traits to avoid the dependency, and for actual features, but I've never seen Rust crates use them to turn off native impls. It's a trait, a major benefit of traits is that it only pollutes the type if you import it.

th317erd commented 2 years ago

I am not sure why you feel I am being "so aggressive", so I am sorry that you feel that way. I was just confused, and your response didn't clarify your position. I am as chill as an ice cube--I am hopeful you knowing so will help you feel less attacked (???).

I appreciate you taking the extra time to clarify all these points.

If you are willing, and have time, could you please follow up with answers to these questions?

1) Grapheme handling is always expensive, is it not? Isn't this implied in the process itself? 2) Is there another way that similar methods/needs could be implemented? I am a little confused on this. I understand what you are telling me about expensive methods, but I am confused about how one would actually use your library without 1) such methods, or 2) continually writing wet code that uses iterators everywhere (and I am not sure how this helps with processing cost?). From my perspective, it sounds like you are basically saying "use graphemes all by themselves, or don't", but graphemes are a standard part of unicode strings, and for something like a parsing library, graphemes may need to be used exclusively, so how could this expense be bypassed, or implemented differently? I am confused with what you are saying, because from my perspective it seems that you need some way to convert from grapheme indecies to byte indecies if you want any hope of (usefully, and dryly) interfacing with the standard str primitive methods in Rust. I am sorry if I myself am sounding thick headed... it may very well be my lack of experience with Rust, it is just from my perspective this all seems like "You can use our library, but only in one limited way. To use it any other way, you will need to implement common methods on your own traits, or in public methods". It seems to me that these methods fit right into your library, and that any user of such methods would inherently know that parsing in such a way is expensive. 🤷🏻‍♂️ 3) Does Rust have a standard convention (naming or otherwise) for "this is expensive, use at your own risk"? 4) Coming from other languages, I am quite surprised with primitive pollution... is this standard practice in Rust? Now I understand this isn't exactly "pollution" in the sense that a trait is being used, and nothing itself about the standard str primitive is being modified, but there is also no way to "undo" the methods that are added to the str primitive through this trait (hence why I use the term "pollution"). Are you telling me that no one who does so (implementing on primitives) with crates, has a way to opt out of it? This seems strange to me at a language/practice level.

I am not being aggressive. I am trying to understand your perspective (and the perspective of the Rust community at large) so I can be a better developer. I appreciate any of your time that you choose to give me.

Manishearth commented 2 years ago

Grapheme handling is always expensive, is it not? Isn't this implied in the process itself?

Not everyone understands the cost, and the more you abstract it away the less clear it is. It is always expensive, but the current API exposes iterators so it's quite clear that it's an iterative API. Exposing it as an indexy API would hide that cost.

continually writing wet code that uses iterators everywhere

This is indeed the way to do it, and Rust code is typically iterator heavy. But I disagree that it's "wet", because for each user the precise way of doing this might be different. The crate exposes the tools for iteration, and it's up to the user to do the bookkeeping the way they need. It really depends on the use case as to how that is done specifically.

Now I understand this isn't exactly "pollution" in the sense that a trait is being used, and nothing itself about the standard str primitive is being modified, but there is also no way to "undo" the methods that are added to the str primitive through this trait (hence why I use the term "pollution"). Are you telling me that no one who does so (implementing on primitives) with crates, has a way to opt out of it? This seems strange to me at a language/practice level.

The methods only show up when you import the trait. So you actually have to take active steps to pollute it locally, by default there's nothing to undo.

There's also no any issue with clashes: Rust allows one to reuse the same name between trait impls and there are ways of disambiguating when necessary.

This is different than other languages because other languages don't typically have the ability to opt in to further methods by importing traits, or they don't have ways of resolving conflicts.

Manishearth commented 2 years ago

I am not sure why you feel I am being "so aggressive", so I am sorry that you feel that way. I was just confused, and your response didn't clarify your position. I am as chill as an ice cube--I am hopeful you knowing so will help you feel less attacked (???).

Also, "i'm sorry that you feel this way" is not an apology. Not that I'm asking for one, but just to be clear.

If you're interacting with open source maintainers and don't want to come off as aggressive, you have to be less demanding. Your response came off as sounding super entitled to my time, relying on some kind of obligation I do not have here. It's fine to make requests, but you gotta make them without implying some kind of obligation.

th317erd commented 2 years ago

I greatly appreciate your responses. I have learned some things from this conversation. I didn't know implementations were added only on import of the trait. I thought they were added as soon as you started using the crate. But now that I am thinking of it, this actually makes sense.

I also greatly appreciate that comment about "obligation". I fairly regularly get the "why so aggressive!?" response when interacting with nearly anyone. I never could understand why people always think I am being "aggressive". I just talk straight, engineer to engineer, assuming positive intent, and that others are assuming positive intent as well. This little hint helps me understand my problem interacting with other developers, and is greatly appreciated. ❤️