Closed adonovan closed 2 years ago
CommonPrefixLen
is harder to use and less intuitive. I hear your argument against CommonPrefix
but there is an easy rebuttal, as suggested by yourself and by this proposed doc comment.
CommonPrefix returns the longest prefix shared by the two arguments. The return value is sliced from the first argument.
If you don't like that way around, flip the arguments.
Both functions will be useful for Golang developers. If this is accepted, I suggest both added to the standard library.
I'd expect a function in the bytes
package would operate on bytes. But several functions in the strings
package operate specifically on runes. Would CommonPrefix{,Len}
split UTF-8 runes?
https://go.dev/play/p/DfUyiNfhvmI
"¼" ¼ 0xc2bc [U+00BC]
"½" ½ 0xc2bd [U+00BD]
"\xc2" � 0xc2 [U+FFFD]
@rhysh Very good question.
[rob] CommonPrefixLen is harder to use and less intuitive.
In the linked examples, the specific callers want the Len variant in a slim majority of cases. I'd be ok with just CommonPrefix though.
[rhys] But several functions in the strings package operate specifically on runes. Would CommonPrefix{,Len} split UTF-8 runes?
Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
Appears that there are 5 possible variations here:
strings.CommonPrefixLen
operates on bytes, returns # of bytesstrings.CommonPrefixLen
operates on runes, returns # of bytesstrings.CommonPrefixLen
operates on runes, returns # of runes (probably useless)strings.CommonPrefix
operates on bytes, returns slice of first arg.strings.CommonPrefix
operates on runes, returns slice of first arg.Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.
Would the UTF-8 aware versions of these functions be compatible with the examples you presented?
If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.
Of course, once you get to combining characters, even UTF-8 boundaries are not always right. So maybe it should be bytes always.
Looking at the examples, it does seem that many of them want the length anyway, so returning the length and avoiding the "which one gets sliced" might be best.
That gets us to "bytes only - no UTF-8 worries - and returning the length".
func CommonPrefixLen(x, y string) int {
n := 0
for n < len(x) && n < len(y) && x[n] == y[n] {
n++
}
return n
}
And same for []byte. What do people think?
I think it's more helpful to slice the result than return its length, only to slice it again, so maybe it should go in anyway. It just seems more natural as an API. If you want both, though, that's fine. And those examples don't feel representative to me, but then I don't write the kind of code others do.
Which will you slice? As I said, do and document the first argument.
The arguments in favor of CommonPrefixLen are that (1) it avoids the "which is sliced" question, and (2) in the examples we looked at, the code was already using the length more often than the slice itself.
If the examples we looked at are not representative, does anyone want to contribute more examples of existing code that would use CommonPrefix or CommonPrefixLen?
Another potential reason for CommonPrefixLen is that this length might be in the middle of a UTF-8 boundary or a grapheme cluster or any of those things. Making the user do the slice might make them think a bit about whether it's a sensible place to slice.
@rsc Your last paragraph doesn't seem compelling. In writing
prefix := s[:strings.CommonPrefixLen(s, t)]
at what moment do I ask where I'm slicing that I wouldn't ask with
prefix := strings.CommonPrefix(s, t)
I'm not arguing against CommonPrefixLen
, I'm just arguing that it feels sound to me to offer CommonPrefix
as well. And for the third time, the question of which to slice is trivial to answer and understand. It's not an issue at all.
@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ? I don't mind returning "\xce", it just seems more error-prone than the 1.
If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.
There's value in uniformity between strings
& bytes
, but having CommonPrefix("βα", "ββ")
return "β\xce"
feels like a footgun.
Currently where the strings
package has to split user provided strings, it does so using runes/Unicode code points: ContainsAny
, IndexAny
, LastIndexAny
, Map
, Trim
, TrimLeft
, TrimRight
.
The exceptions are functions that have a Byte suffix.
I think whatever is added to strings
should keep operating on runes/codepoints unless otherwise explicitly labeled.
@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ? I don't mind returning "\xce", it just seems more error-prone than the 1.
Are we talking byte slices or strings? For strings, I would expect the answer to be 0. For byte slices, I think the jury is still out.
For byte slices, the correct answer is (I assert) obviously 1, but for strings things are less clear: consistency across the two packages would demand that the answer also be 1, even though that means slicing a rune in half. I can live with that (given a suitable warning in the doc comment), but if there is no clear consensus that this is the right behavior, then that's (unfortunately) an argument against adding the function at all.
@adonovan I see it exactly the opposite, that the value for strings is rune-based, and the value for bytes unclear, with consistency being a consideration in the decision. The strings package shouldn't slice a rune in half.
Ignoring memory aliasing issues, I expect it to be safe to swap all usages of the bytes
package with the strings
package and vice-versa with no change in semantics. Are there existing cases where this property does not hold?
Change https://go.dev/cl/413716 mentions this issue: bytes: add fuzz test to ensure compatibility with strings
Are there existing cases where this property does not hold?
I wrote a fuzz test (https://go.dev/cl/413716) verifying whether bytes
and strings
are identical for all common functions, and it seems (after a few hours of fuzzing) the answer is yes (except for a minor bug #53511).
As such, I would expect the behavior of bytes.CommonPrefix
and strings.CommonPrefix
to be identical (whether it be byte-based or rune-based).
I see two possible solutions:
PrefixRunes
and PrefixBytes
(to both strings and bytes packages)Hmm, I think I have been laboring under the misapprehension that the bytes package has a different viewpoint of the contents of byte strings than the strings package does about the content of strings (arbitrary binary vs UTF-8 encoded text). But I see that is not the case, and Rob's argument that both versions should respect rune boundaries now makes a lot more sense. All of the specific cases I encountered would work with a rune-boundary respecting implementation. Therefore I propose this new API:
package strings
// CommonPrefixLen returns the length of the common prefix of two strings containing UTF-8 text.
// The resulting value is the index of a rune boundary.
func CommonPrefixLen(x, y string) int
package bytes
s/string/[]byte/
s/strings/byte slices/
It is worth stating explicitly that the function compares encodings, not decodings? That is, two different encodings of U+FFFD cannot be part of the common prefix?
[Russ raises a good point of objection below.]
Suppose we do the code point semantics. Then
CommonPrefix("🧑🔧", "👨🏭") = "🧑\u200d" (Man, ZWJ, Wrench and Man, ZWJ, Factory produces Man, ZWJ)
There's going to be weird boundary cases no matter what.
It would also be very surprising for strings and bytes to differ here. I don't think there are any functions they have in common today that change semantics based on whether your UTF-8 is in []byte or string.
There's also an option to add this to the generic slices package; it has well-defined semantics for any slice with comparable elements. That makes the semantics clear--indexing yields bytes, so it is byte-by-byte. The generic code could do a type assertion and dispatch to the faster code for specific types.
To solve the "but it slices runes / grapheme clusters in half" problem, we could have a function that, given a string/byte slice, returns the prefix that end at a rune/cluster boundary. Then if you care about runes, you first find the shared prefix and then take the valid prefix of that. (Given that std is currently completely agnostic to grapheme clusters, it'd probably be UTF8 rune prefix, not valid cluster prefix. It'd be nice to have good grapheme cluster support in std, but that's an entirely different proposal.) utf8.DecodeLastRune[InString] is close to this but requires a little bit of work to hook up correctly.
That's a generally useful API, and it composes well. By operating on a single string/byte slice, it doesn't raise any questions about which input it aliases.
There's also an option to add this to the generic slices package
One of the major advantages of having it in strings
and bytes
is that we can optimize for the fact that this is a just a byte sequence of some kind so that we can utilize multi-byte comparisons. In @adonovan original post, he reported a 6x improvement over a naive element-to-element comparison.
That said, I guess we can have a switch statement in the generic implementation of slices.CommonPrefix
to optimize for strings and []byte.
It seems like we are getting close to bike-shedding this. To try to break the logjam, I suggest we add:
package strings
// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y string) string
package bytes
// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y []byte) []byte
Does anyone object to this?
I think it should slice the first argument instead of the second one.
@icholy The comment above says that the CommonPrefix
functions return a slice of x
, which is the first argument. So I think what you say is already the case. If not, could you show some code to explain what you mean? Thanks.
I think it's a serious mistake to have CommonPrefix, at least for strings, slice into a UTF-8 encoding. Very serious. I would never use it if it did, and would recommend that others avoid it as well. Not just out of safety, although that's paramount, I can't think why I'd want it.
In the following example I expect it to print true
or nothing.
if strings.HasPrefix(x, p) {
print(strings.CommonPrefix(x, p) == p)
}
It can print false
if CommonPrefix
considers the strings to be UTF-8 encoded.
@gazerro Another interesting example, and I agree that's a desirable consistency property.
It appears there is no good solution to all these constraints. Here's an alternative approach:
We add a bytes.CommonPrefixLen function that works on arbitrary bytes without regard to UTF-8. But we don't add the analogous function to the strings package, and we document the reason why. Users that want the efficient implementation on strings interpreted as raw bytes can call bytes.CommonPrefixLen([]byte(str1), []byte(str2)) and the compiler will ensure that these conversions do not escape.
I disagree with the notion that string
and bytes
differ based on their handling of UTF-8. As things stand today, bytes
already has some understanding of UTF-8 in some functions, but not in others. Thus, whether UTF-8 is considered is not package-specific, but rather function-specific (and appropriately documented in each case). The distinction between the two packages is just that bytes
deals with []byte
and strings
deals with string
.
Another point of view is to see how CommonPrefix
would be used in the examples in the standard library.
In the following examples, at least one of the two strings can only contain ASCII characters. So the returned slice always contains only ASCII characters, no matter if CommonPrefix
works on arbitrary bytes or UTF-8:
In the following example, CommonPrefix
must work on arbitrary bytes in order to be used:
In the following example, CommonPrefix
is preferable to work on arbitrary bytes, but is not necessary:
For the following, however, CommonPrefix
cannot be used because they need a particular function:
There's clearly no consensus here, and we've made it 10+ years without this function. It sounds like we should probably decline the proposal.
I think this should be re-opened as an addition to exp/slices
.
Based on the discussion above, this proposal seems like a likely decline. — rsc for the proposal review group
It looks like the UTF-8 vs bytes discussion missed the main point for adding the CommonPrefixLen() function to bytes
package - performance for text processing and compression algorithms. It is trivial to write the function in Go, but it isn't trivial to optimize it for performance, especially on multiple GOARCHs. So it would be great to have the optimized function in standard library in the same way as bytes.Equal()
is optimized now.
As for the utf-8 vs bytes semantics, it is better not implementing this function in strings
package because the performance isn't usually critical for utf-8 processing. So it is better to write a trivial function every time the common prefix for utf-8-encoded strings is needed.
A common recurring operation on strings is to remove or advance past the common prefix of two strings. Here are some examples from the main Go repository and the golang.org/x/tools repo:
https://github.com/golang/tools/blob/master/go/internal/gcimporter/bexport.go#L264 https://github.com/golang/tools/blob/master/internal/lsp/source/comment.go#L229 https://github.com/golang/go/blob/master/src/cmd/compile/internal/base/timings.go#L122 https://github.com/golang/go/blob/master/src/cmd/compile/internal/ir/dump.go#L129 https://github.com/golang/go/blob/master/src/strings/replace.go#L170 https://github.com/golang/go/blob/master/src/net/addrselect.go#L199 https://github.com/golang/go/blob/master/src/go/printer/printer.go#L485 https://github.com/golang/go/blob/master/src/go/doc/comment/parse.go#L397
This function is easy enough to implement by hand:
but it's possible to make it significantly faster than the naive version. This CL contains a version that runs 6x faster yet is still safe and portable. The code describes ways that it might be made much faster still at the cost of
unsafe
or assembly: https://go-review.googlesource.com/c/go/+/408116While benchmarking and optimizing a diff algorithm this week, I noticed that one of the main profiling hotspots was, in essence, a CommonPrefixLen operation: https://go-review.googlesource.com/c/tools/+/396855/11/internal/lsp/diff/lcs/old.go#219 (It's a little hard to see in this form, but in a private branch I refactored the code so that the dynamic dispatch of
eq(x, y byte)
was moved outside the loop. In essence the primitive operation over which this algorithm is abstracted is CommonPrefixLen.)I propose that we add
func CommonPrefixLen(x, y string) int
to the standardstrings
package. An analogous function should be added to thebytes
package too.We could also add a
CommonPrefix(x, y string) string
function, which actually strips off the prefix rather than merely returning its length, but I don't think this is necessary as it is trivial to implement. Also, whereasCommonPrefixLen
makes sense for bytes, it's less clear whatbytes.CommonPrefix
should do, since its result must alias one of the arguments. The first one presumably? Better to avoid the question.