golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.99k stars 17.54k forks source link

proposal: runes: create new package analogous to bytes, for rune slices #34313

Closed srinathh closed 4 years ago

srinathh commented 5 years ago

Working with and manipulating non-English data requires us to use runes slices. If we want to do operations like comparing two rune slices, replacing, indexing etc, we have to cast to string, do those operations and cast back or write custom functions.

I would like to therefore propose creating a package runes mirroring the package bytes with functionality to work directly with rune slices rather than bytes to support international language use cases

bserdar commented 5 years ago

This would not be necessary once (if) generics are implemented.

lootch commented 5 years ago

On 9/16/19, Burak Serdar notifications@github.com wrote:

This would not be necessary once (if) generics are implemented.

You could also claim (I do) that if something like this came along, there would be less justification for "generics".

Frankly, generics require a much shallower boundary between intrinsic and user-defined objects or, perhaps more usefully, but much more difficult to do right, a much richer "type" mechanism with open-ended attributes.

Go with generics then becomes either a beautiful academic artifact or a Frankenstein monster of a language. Guess which is more likely to happen first.

Incidentally, even knowing that the Go Team's efforts put the integrity of the language very high on the list of objectives, it is still quite revealing that there is no "Go with Generics" in the wild, whether to be disparaged or to be revered.

Lucio.

bserdar commented 5 years ago

On Sun, Sep 15, 2019 at 10:00 PM lootch notifications@github.com wrote:

On 9/16/19, Burak Serdar notifications@github.com wrote:

This would not be necessary once (if) generics are implemented.

You could also claim (I do) that if something like this came along, there would be less justification for "generics".

Frankly, generics require a much shallower boundary between intrinsic and user-defined objects or, perhaps more usefully, but much more difficult to do right, a much richer "type" mechanism with open-ended attributes.

Go with generics then becomes either a beautiful academic artifact or a Frankenstein monster of a language. Guess which is more likely to happen first.

I disagree. I think the latest generics proposal has a chance to be useful without becoming a monster. The idea that in order to implement generics you have to define the semantics of the generic types precisely is what created c++/Java generics. Defining generics in terms of existing types has a better chance of being used correctly because it demands less from the author and from the reader.

Incidentally, even knowing that the Go Team's efforts put the integrity of the language very high on the list of objectives, it is still quite revealing that there is no "Go with Generics" in the wild, whether to be disparaged or to be revered.

I think the reason for this is the experience with the c++/java generics, and despite all the efforts, many counter-proposals ended up offering similar solutions.

Lucio.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/34313?email_source=notifications&email_token=AA4AGDNAYJ6EF3SDURKWZDDQJ4AEDA5CNFSM4IW4YTDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6YBNGQ#issuecomment-531633818, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4AGDIJ6DIY2LM72MIKAXLQJ4AEDANCNFSM4IW4YTDA .

robpike commented 5 years ago

I have trouble with your opening sentence: Working with and manipulating non-English data requires us to use runes slices. That is presented as a fact but is an opinion, one I just don't think is true.

I speak only English but I have spent a lot of time working with text that is not ASCII and, although it can be attractive to work with rune slices, they are not really a good solution. In fact, I think they are a trap: they don't answer most of the questions that persist with multilingual text because, despite what many want to believe, a rune is not a character. (See blog.golang.org/strings for an explanation of this.)

I would therefore prefer not to add such a package as it would promote bad practice.

srinathh commented 5 years ago

@robpike I hear you but now I'm really puzzled. My take away from your blog post (which I have revisited many times over the years including just before making this proposal today) is that runes are a better way to deal with non-english characters and smileys ad what not vs. bytes. Ranging over a string gives runes.

Now I do recall from reading the article linked to in your blog that some Unicode code points are modifiers and what not and some characters can be made with multiple combination of Unicode code points and they can mess things up but what's a better way to deal with mutable collections of Unicode code points than a slice of runes that's made available in Go?

robpike commented 5 years ago

Runes are code points, from which characters are made. Bytes are also things from which characters are made. Why use both?

Sometimes we need the code points themselves, but providing a package that handles slices of them will encourage the poor practice of converting back and forth between rune slices and bytes slices/strings rather than the more efficient method of just iterating the bytes appropriately.

srinathh commented 5 years ago

May I share an example use case? Suppose we're building a simple text editor. When people enter text, the enter unicode code points to make characters. If we use rune slices, we can simply insert the required rune at the right position.

If we are using byte slices, for each insertion or deletion, we would have to iterate the slice through a function to parse Unicode, find the right position to insert or delete & make the change. Since this iteration can throw an error, we'd have to check for error. If we are using strings, we'd have to reallocate for every single insertion or deletion & then again run iterations.

Essentially if we want to work with mutable sets of unicode characters, then neither the bytes solution nor the strings solution seems efficient

ghost commented 4 years ago

off topic, but I thought to mention Perl6 here

https://www.evanmiller.org/a-review-of-perl-6.html

cf: Strings and Regexes

caveat, see footnote 2

a contributor to Perl6

https://perlgeek.de/

also wrote this module

https://metacpan.org/pod/Perl6::Str

ghost commented 4 years ago

the idea of using rope data structures in an editor intrigued me at one point

but I've never taken the time to look into it

robpike commented 4 years ago

Essentially if we want to work with mutable sets of unicode characters, then neither the bytes solution nor the strings solution seems efficient

And the runes solution is misleading and leads to incorrect thinking. Text is hard, and rune slices solve almost none of what makes text hard.

ghost commented 4 years ago

on a side note

A Philosophy of Software Design

by J. Ousterhout

The book includes commentary on a student project of writing a text editor.

rsc commented 4 years ago

Using runes in a text editor seems like a good idea at first, but it fails badly once you get to Unicode compose sequences, like e + composing acute vs é. The former is two runes while the latter is one. And for some sequences there's not even a single-rune sequence. In general Unicode text processing requires considering largish sequences of input, not just a single byte and not just a single rune either. There's little benefit to []rune as the representation, and there are real drawbacks to having two representations. So Go has standardized on []byte/string and UTF-8.

If you find that []rune works really well for your editor somehow (maybe you ignore all the multirune characters), that's fine. A "runes" library forked from "bytes" could easily be maintained as a go get-able package outside the standard library.

Note that generics are not going to help here, because the encoding stored in the underlying data is different between []byte and []rune.

This is a likely decline. Leaving open for a week for final comments.

ghost commented 4 years ago

Hopefully my comment won't be interpreted as cultural bias.

I'm opposed to this on linguistic reasons.

Rune is used in Plan 9, and also appears in Golang.

The suggested use diverging excessively from the original North Germanic languages' use of the word.

D. Mendeleev used एक (eka) and द्वि (dvi) for certain postulated elements.

экаалюминій, экаборъ, экасилицій двимарганец

rsc commented 4 years ago

There have been no comments objecting to declining this issue. Declined.