Closed npat-efault closed 8 years ago
I think this is a rarely occurring optimization opportunity. That said, the cost for its detection, paid in all other, more often seen cases, is possibly high.
Any real world examples which would be helped by this?
I believe that the cost of runtime-memmove()
detecting that source and destination addresses are the same, is minimal (one or two instructions). Almost barely noticeable and only for extremely small slices.
Then again, you are right, it is rarely occurring.
I doubt cap(a) >= len(a)+len(b) && &a[:len(a)+1][len(a)] == &b[0]
can compile to one or two instructions.
I agree that this seems super rare and not worth the code to check for it. If you know you're in this situation (or might be), you can check yourself and reslice to avoid the copy. Closing.
@cznic: I don't think you have to do exactly this. This is purely for demonstration purposes (the best I could do with "user code"). All you'd have to do is, in fact, add a CMPQ DI, SI; JEQ move_0
(or something similar, I'm not familiar with the go assembler) to the runtime-memmove()
implementation.
If I'm not mistaken.
Right, memmove could do a pretty simple check. But we're not going to add that absent some real code which cares.
I was thinking of code along the lines of this:
parsed := src[0:0]
for len(src) > 0 {
n := keep(src)
parsed = append(parsed, src[:n]...)
src = src[n:]
n = drop(src)
src = src[n:]
}
Arguably it may not be the best way to code something like this, but it is not totally unnatural. In this case, if drop()
returns 0, you should not have to pay for the extra copy...
Anyway, as you said, it can be avoided.
I just thought I'd point-out the issue (and the fact that it can easily be fixed). Whether something should be done about it, is a judgement call, and is up to you...
Fixed a couple of typos in the example code of my previous comment
You could write it like this:
// r and w are read and write cursors for src.
r, w := 0, 0
for r < len(src) {
n := keep(src[r:])
if r != w {
copy(src[w:], src[r:r+n])
}
r, w = r+n, w+n
r += drop(src[r:])
}
parsed := src[:w]
@nigeltao Yes, of course I could. And there are probably other ways, as well. The thing is, the code I posted was the first thing that came to my mind (instinctively). And immediately I though: "I guess append()
will be clever enough to avoid the copy and just do a re-slice (extend)". I checked, and it turns out it isn't. Which was a bit surprising. Especially considering that the cost of detecting the (uncommon) special case turns out to be negligible. I think (and this is a judgement call) that doing "the right thing" in this case justifies the (almost negligible) cost.
@nigeltao Also, in my original case, parsed
was passed as an argument, and it could, or could not, "coincide" with src
. Yes you could also handle this with some extra code, but this is besides the point... (it's the "asymmetry" that matters)
so redundant compiler/runtime checks are fine, but this extra check is a concern? poppycock!
Whether the cost is negligible or not is the deciding factor for adding a special case.
If the case only happens one in a million times, then no matter how cheap the optimization is, we shouldn't add it.
And the optimization cost is non-negligible anyway as it will at least cost one more entry in the branch predictor tables.
That being said, I instrumented the memmove routine and found that runtime.typedmemmove actually does trigger runtime.memmove with src == dest. Perhaps it's worthwhile to investigate whether we can eliminate such unnecessary memmove calls.
self memmove is not rare during all.bash, but the count is almost never larger than 0x20 bytes.
Most of the cases come from math/big, where the API always takes a result parameter, and in a lot of cases, the final step is to copy the result into the result parameter.
We will need a more extensive benchmark analysis to see whether memmove should special case this.
Ah, I see! The same memmove is used in many cases, other than slice-copy, slice-append, and the likes. A lot of them may be very small-size moves, where the cost of adding the check could be considerable (or maybe not). We could consider adding the check only for slice appends (which is more complicated that adding a simple compare to memmove). In this case, though, it would mean adding a couple of extra instructions at every append call site... not an obvious gain.
What if we add the test to memmove for all but the smallest sizes? Just a thought.
Adding the test above a size threshold would certainly bound the overhead of the check.
On Tue, Oct 25, 2016 at 12:53 PM, Nick Patavalis notifications@github.com wrote:
What if we add the test to memmove for all but the smallest sizes? Just a thought.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/17530#issuecomment-256156593, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkgIFHP3HmXrnZ-X2DxuPUP4y6Y0pLCks5q3l4kgaJpZM4KcYDI .
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.7.3 linux/amd64
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64" GOBIN="" GOEXE="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/home/npat" GORACE="" GOROOT="/usr/local/go" GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" CC="gcc" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build296609692=/tmp/go-build -gno-record-gcc-switches" CXX="g++" CGO_ENABLED="1"
What did you do?
It seems that the built-in append does not check for a situation like the following (appending consecutive slices, with the same underlying array) , and needlessly copies data, when it could avoid it.
If I'm following the breadcrumbs correctly, they lead to the fact that
runtime-memmove()
does not short-circuit when source and destination addresses are the same (I may be wrong, though).Running the simple benchmarks at:
https://gist.github.com/npat-efault/f055654633f43d0ef771d38657fd499e
for any value of N, demonstrates this.
What did you expect to see?
Built-in append detecting the trivial case and behaving (performance-wise) like
appendBytes
above, running time being independent of sliceb
sizeWhat did you see instead?
Running time increases with slice
b
size.