golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124k stars 17.67k forks source link

proposal: strings: add func CommonPrefixLen(x, y string) int #53120

Closed adonovan closed 2 years ago

adonovan commented 2 years ago

A common recurring operation on strings is to remove or advance past the common prefix of two strings. Here are some examples from the main Go repository and the golang.org/x/tools repo:

https://github.com/golang/tools/blob/master/go/internal/gcimporter/bexport.go#L264 https://github.com/golang/tools/blob/master/internal/lsp/source/comment.go#L229 https://github.com/golang/go/blob/master/src/cmd/compile/internal/base/timings.go#L122 https://github.com/golang/go/blob/master/src/cmd/compile/internal/ir/dump.go#L129 https://github.com/golang/go/blob/master/src/strings/replace.go#L170 https://github.com/golang/go/blob/master/src/net/addrselect.go#L199 https://github.com/golang/go/blob/master/src/go/printer/printer.go#L485 https://github.com/golang/go/blob/master/src/go/doc/comment/parse.go#L397

This function is easy enough to implement by hand:

commonLen :=  min(len(a), len(b))
for i = 0; i < commonLen && a[i] == b[i]; i++ {}

but it's possible to make it significantly faster than the naive version. This CL contains a version that runs 6x faster yet is still safe and portable. The code describes ways that it might be made much faster still at the cost of unsafe or assembly: https://go-review.googlesource.com/c/go/+/408116

While benchmarking and optimizing a diff algorithm this week, I noticed that one of the main profiling hotspots was, in essence, a CommonPrefixLen operation: https://go-review.googlesource.com/c/tools/+/396855/11/internal/lsp/diff/lcs/old.go#219 (It's a little hard to see in this form, but in a private branch I refactored the code so that the dynamic dispatch of eq(x, y byte) was moved outside the loop. In essence the primitive operation over which this algorithm is abstracted is CommonPrefixLen.)

I propose that we add func CommonPrefixLen(x, y string) int to the standard strings package. An analogous function should be added to the bytes package too.

We could also add a CommonPrefix(x, y string) string function, which actually strips off the prefix rather than merely returning its length, but I don't think this is necessary as it is trivial to implement. Also, whereas CommonPrefixLen makes sense for bytes, it's less clear what bytes.CommonPrefix should do, since its result must alias one of the arguments. The first one presumably? Better to avoid the question.

robpike commented 2 years ago

CommonPrefixLen is harder to use and less intuitive. I hear your argument against CommonPrefix but there is an easy rebuttal, as suggested by yourself and by this proposed doc comment.

CommonPrefix returns the longest prefix shared by the two arguments. The return value is sliced from the first argument.

If you don't like that way around, flip the arguments.

SeanTolstoyevski commented 2 years ago

Both functions will be useful for Golang developers. If this is accepted, I suggest both added to the standard library.

rhysh commented 2 years ago

I'd expect a function in the bytes package would operate on bytes. But several functions in the strings package operate specifically on runes. Would CommonPrefix{,Len} split UTF-8 runes?

https://go.dev/play/p/DfUyiNfhvmI

"¼" ¼ 0xc2bc [U+00BC]
"½" ½ 0xc2bd [U+00BD]
"\xc2" � 0xc2 [U+FFFD]
robpike commented 2 years ago

@rhysh Very good question.

adonovan commented 2 years ago

[rob] CommonPrefixLen is harder to use and less intuitive.

In the linked examples, the specific callers want the Len variant in a slim majority of cases. I'd be ok with just CommonPrefix though.

[rhys] But several functions in the strings package operate specifically on runes. Would CommonPrefix{,Len} split UTF-8 runes?

Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

icholy commented 2 years ago

Appears that there are 5 possible variations here:

  1. strings.CommonPrefixLen operates on bytes, returns # of bytes
  2. strings.CommonPrefixLen operates on runes, returns # of bytes
  3. strings.CommonPrefixLen operates on runes, returns # of runes (probably useless)
  4. strings.CommonPrefix operates on bytes, returns slice of first arg.
  5. strings.CommonPrefix operates on runes, returns slice of first arg.

Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.

Would the UTF-8 aware versions of these functions be compatible with the examples you presented?

rsc commented 2 years ago

If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.

Of course, once you get to combining characters, even UTF-8 boundaries are not always right. So maybe it should be bytes always.

Looking at the examples, it does seem that many of them want the length anyway, so returning the length and avoiding the "which one gets sliced" might be best.

That gets us to "bytes only - no UTF-8 worries - and returning the length".

func CommonPrefixLen(x, y string) int {
    n := 0
    for n < len(x) && n < len(y) && x[n] == y[n] {
        n++
    }
    return n
}

And same for []byte. What do people think?

robpike commented 2 years ago

I think it's more helpful to slice the result than return its length, only to slice it again, so maybe it should go in anyway. It just seems more natural as an API. If you want both, though, that's fine. And those examples don't feel representative to me, but then I don't write the kind of code others do.

Which will you slice? As I said, do and document the first argument.

rsc commented 2 years ago

The arguments in favor of CommonPrefixLen are that (1) it avoids the "which is sliced" question, and (2) in the examples we looked at, the code was already using the length more often than the slice itself.

If the examples we looked at are not representative, does anyone want to contribute more examples of existing code that would use CommonPrefix or CommonPrefixLen?

Another potential reason for CommonPrefixLen is that this length might be in the middle of a UTF-8 boundary or a grapheme cluster or any of those things. Making the user do the slice might make them think a bit about whether it's a sensible place to slice.

robpike commented 2 years ago

@rsc Your last paragraph doesn't seem compelling. In writing

prefix := s[:strings.CommonPrefixLen(s, t)]

at what moment do I ask where I'm slicing that I wouldn't ask with

prefix := strings.CommonPrefix(s, t)

I'm not arguing against CommonPrefixLen, I'm just arguing that it feels sound to me to offer CommonPrefix as well. And for the third time, the question of which to slice is trivial to answer and understand. It's not an issue at all.

rsc commented 2 years ago

@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ? I don't mind returning "\xce", it just seems more error-prone than the 1.

icholy commented 2 years ago

If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.

There's value in uniformity between strings & bytes, but having CommonPrefix("βα", "ββ") return "β\xce" feels like a footgun.

seankhliao commented 2 years ago

Currently where the strings package has to split user provided strings, it does so using runes/Unicode code points: ContainsAny, IndexAny, LastIndexAny, Map, Trim, TrimLeft, TrimRight. The exceptions are functions that have a Byte suffix. I think whatever is added to strings should keep operating on runes/codepoints unless otherwise explicitly labeled.

robpike commented 2 years ago

@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ? I don't mind returning "\xce", it just seems more error-prone than the 1.

Are we talking byte slices or strings? For strings, I would expect the answer to be 0. For byte slices, I think the jury is still out.

adonovan commented 2 years ago

For byte slices, the correct answer is (I assert) obviously 1, but for strings things are less clear: consistency across the two packages would demand that the answer also be 1, even though that means slicing a rune in half. I can live with that (given a suitable warning in the doc comment), but if there is no clear consensus that this is the right behavior, then that's (unfortunately) an argument against adding the function at all.

robpike commented 2 years ago

@adonovan I see it exactly the opposite, that the value for strings is rune-based, and the value for bytes unclear, with consistency being a consideration in the decision. The strings package shouldn't slice a rune in half.

dsnet commented 2 years ago

Ignoring memory aliasing issues, I expect it to be safe to swap all usages of the bytes package with the strings package and vice-versa with no change in semantics. Are there existing cases where this property does not hold?

gopherbot commented 2 years ago

Change https://go.dev/cl/413716 mentions this issue: bytes: add fuzz test to ensure compatibility with strings

dsnet commented 2 years ago

Are there existing cases where this property does not hold?

I wrote a fuzz test (https://go.dev/cl/413716) verifying whether bytes and strings are identical for all common functions, and it seems (after a few hours of fuzzing) the answer is yes (except for a minor bug #53511).

As such, I would expect the behavior of bytes.CommonPrefix and strings.CommonPrefix to be identical (whether it be byte-based or rune-based).

mibk commented 2 years ago

I see two possible solutions:

adonovan commented 2 years ago

Hmm, I think I have been laboring under the misapprehension that the bytes package has a different viewpoint of the contents of byte strings than the strings package does about the content of strings (arbitrary binary vs UTF-8 encoded text). But I see that is not the case, and Rob's argument that both versions should respect rune boundaries now makes a lot more sense. All of the specific cases I encountered would work with a rune-boundary respecting implementation. Therefore I propose this new API:

package strings

// CommonPrefixLen returns the length of the common prefix of two strings containing UTF-8 text.
// The resulting value is the index of a rune boundary.
func CommonPrefixLen(x, y string) int

package bytes

s/string/[]byte/
s/strings/byte slices/

It is worth stating explicitly that the function compares encodings, not decodings? That is, two different encodings of U+FFFD cannot be part of the common prefix?

[Russ raises a good point of objection below.]

rsc commented 2 years ago

Suppose we do the code point semantics. Then

CommonPrefix("🧑‍🔧", "👨‍🏭") = "🧑‍\u200d" (Man, ZWJ, Wrench and Man, ZWJ, Factory produces Man, ZWJ)

There's going to be weird boundary cases no matter what.

It would also be very surprising for strings and bytes to differ here. I don't think there are any functions they have in common today that change semantics based on whether your UTF-8 is in []byte or string.

josharian commented 2 years ago

There's also an option to add this to the generic slices package; it has well-defined semantics for any slice with comparable elements. That makes the semantics clear--indexing yields bytes, so it is byte-by-byte. The generic code could do a type assertion and dispatch to the faster code for specific types.

josharian commented 2 years ago

To solve the "but it slices runes / grapheme clusters in half" problem, we could have a function that, given a string/byte slice, returns the prefix that end at a rune/cluster boundary. Then if you care about runes, you first find the shared prefix and then take the valid prefix of that. (Given that std is currently completely agnostic to grapheme clusters, it'd probably be UTF8 rune prefix, not valid cluster prefix. It'd be nice to have good grapheme cluster support in std, but that's an entirely different proposal.) utf8.DecodeLastRune[InString] is close to this but requires a little bit of work to hook up correctly.

That's a generally useful API, and it composes well. By operating on a single string/byte slice, it doesn't raise any questions about which input it aliases.

dsnet commented 2 years ago

There's also an option to add this to the generic slices package

One of the major advantages of having it in strings and bytes is that we can optimize for the fact that this is a just a byte sequence of some kind so that we can utilize multi-byte comparisons. In @adonovan original post, he reported a 6x improvement over a naive element-to-element comparison.

That said, I guess we can have a switch statement in the generic implementation of slices.CommonPrefix to optimize for strings and []byte.

rsc commented 2 years ago

It seems like we are getting close to bike-shedding this. To try to break the logjam, I suggest we add:

package strings

// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y string) string

package bytes

// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y []byte) []byte

Does anyone object to this?

icholy commented 2 years ago

I think it should slice the first argument instead of the second one.

ianlancetaylor commented 2 years ago

@icholy The comment above says that the CommonPrefix functions return a slice of x, which is the first argument. So I think what you say is already the case. If not, could you show some code to explain what you mean? Thanks.

robpike commented 2 years ago

I think it's a serious mistake to have CommonPrefix, at least for strings, slice into a UTF-8 encoding. Very serious. I would never use it if it did, and would recommend that others avoid it as well. Not just out of safety, although that's paramount, I can't think why I'd want it.

gazerro commented 2 years ago

In the following example I expect it to print true or nothing.

if strings.HasPrefix(x, p) {
     print(strings.CommonPrefix(x, p) == p)
}

It can print false if CommonPrefix considers the strings to be UTF-8 encoded.

adonovan commented 2 years ago

@gazerro Another interesting example, and I agree that's a desirable consistency property.

It appears there is no good solution to all these constraints. Here's an alternative approach:

We add a bytes.CommonPrefixLen function that works on arbitrary bytes without regard to UTF-8. But we don't add the analogous function to the strings package, and we document the reason why. Users that want the efficient implementation on strings interpreted as raw bytes can call bytes.CommonPrefixLen([]byte(str1), []byte(str2)) and the compiler will ensure that these conversions do not escape.

dsnet commented 2 years ago

I disagree with the notion that string and bytes differ based on their handling of UTF-8. As things stand today, bytes already has some understanding of UTF-8 in some functions, but not in others. Thus, whether UTF-8 is considered is not package-specific, but rather function-specific (and appropriately documented in each case). The distinction between the two packages is just that bytes deals with []byte and strings deals with string.

gazerro commented 2 years ago

Another point of view is to see how CommonPrefix would be used in the examples in the standard library.

In the following examples, at least one of the two strings can only contain ASCII characters. So the returned slice always contains only ASCII characters, no matter if CommonPrefix works on arbitrary bytes or UTF-8:

In the following example, CommonPrefix must work on arbitrary bytes in order to be used:

In the following example, CommonPrefix is preferable to work on arbitrary bytes, but is not necessary:

For the following, however, CommonPrefix cannot be used because they need a particular function:

rsc commented 2 years ago

There's clearly no consensus here, and we've made it 10+ years without this function. It sounds like we should probably decline the proposal.

icholy commented 2 years ago

I think this should be re-opened as an addition to exp/slices.

rsc commented 2 years ago

Based on the discussion above, this proposal seems like a likely decline. — rsc for the proposal review group

rsc commented 2 years ago

No change in consensus, so declined. — rsc for the proposal review group

valyala commented 1 year ago

It looks like the UTF-8 vs bytes discussion missed the main point for adding the CommonPrefixLen() function to bytes package - performance for text processing and compression algorithms. It is trivial to write the function in Go, but it isn't trivial to optimize it for performance, especially on multiple GOARCHs. So it would be great to have the optimized function in standard library in the same way as bytes.Equal() is optimized now.

As for the utf-8 vs bytes semantics, it is better not implementing this function in strings package because the performance isn't usually critical for utf-8 processing. So it is better to write a trivial function every time the common prefix for utf-8-encoded strings is needed.