bytes: IndexAny is slower than a naive implementation in some cases

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

123.56k stars 17.61k forks source link

bytes: IndexAny is slower than a naive implementation in some cases #60550

Closed kpym closed 1 year ago

kpym commented 1 year ago

What version of Go are you using (`go version`)?

$ go version
go version go1.20.3 windows/amd64

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Kroum\AppData\Local\go-build
set GOENV=C:\Users\Kroum\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=D:\MyPrograms\golibs\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=D:\MyPrograms\golibs
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.20.3
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=D:\MyDocuments\Programming\go\test\bench\go.mod
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=D:\Temp\System\Kroum\go-build2837998563=/tmp/go-build -gno-record-gcc-switches

What did you do?

I consider the following naive implementation of bytes.IndexAny

func IndexAny(s []byte, chars string) int {
    min := -1
    for _, c := range chars {
        if i := bytes.IndexRune(s, c); uint(i) < uint(min) {
            min = i
        }
    }
    return min
}

I benchmark it for 10 delimiters on a data where the first delimiter appears at position 1000, like this

package indexany

import (
    "bytes"
    "strings"
    "testing"
)

func IndexAny(s []byte, chars string) int {
    min := -1
    for _, c := range chars {
        if i := bytes.IndexRune(s, c); uint(i) < uint(min) {
            min = i
        }
    }
    return min
}

const (
    t             = "🚆"
    firstchar     = 1000
    numdelimiters = 10
)

var (
    data       []byte
    separators string
)

// create the data and the separators fot the benchmark
func init() {
    // data = "xxx...🚆"
    data = bytes.Repeat([]byte("x"), firstchar)
    data = append(data, t...)
    // separators = "yyy...🚆"
    separators = strings.Repeat("y", numdelimiters)
    separators = separators + t
}

func BenchmarkNaiveIndexAny(b *testing.B) {
    for i := 0; i < b.N; i++ {
        IndexAny(data, separators)
    }
}

func BenchmarkBytesIndexAny(b *testing.B) {
    for i := 0; i < b.N; i++ {
        bytes.IndexAny(data, separators)
    }
}

What did you expect to see?

That bytes.IndexAny is faster then this "naive" IndexAny.

What did you see instead?

$ go test -benchmem -bench=. .
goos: windows
goarch: amd64
pkg: kpym.github.com/gobench/indexany
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
BenchmarkNaiveIndexAny-8         3033126               522.9 ns/op             0 B/op          0 allocs/op
BenchmarkBytesIndexAny-8          223227              7233 ns/op               0 B/op          0 allocs/op
PASS
ok      kpym.github.com/gobench/indexany        4.325s

So this "naive" implementation is more than 10 times faster than the standard bytes.IndexAny.

Am I doing something wrong ? Is this relevant ?

Remark : If we take a lot of delimiters (1000) the standard function is 30% better (on my computer).

mknyszek commented 1 year ago

CC @ianlancetaylor via https://dev.golang.org/owners

ianlancetaylor commented 1 year ago

Any algorithm is going to have cases where it is better or worse. The question is how to get good average performance for typical cases.

How does your implementation fare on the benchmarks that are already in the bytes package?

The current bytes.IndexAny algorithm is somewhat optimized for the case where the characters being searched for (chars) are all ASCII. It may be that we can do better for cases that is not true.

kpym commented 1 year ago

@ianlancetaylor

Any algorithm is going to have cases where it is better or worse. The question is how to get good average performance for typical cases.

I agree. When I searched for bytes.IndexAny on GitHub and in the standard library, it looks like that it is used in almost all the cases :

with between 2 and 4 characters (delimiters),
with characters that are ASCII (like "\r\n", " \t\r\n" or " ,;\r\n").

How does your implementation fare on the benchmarks that are already in the bytes package?

I didn't find any benchmarks for indexAny in the standard library. Am I missing something?

But in all benchmarks with a small number of separators, even if they are all ASCII, the naive version is (much) faster. Except when the first occurrence is really close to the beginning (less than 4). But in this case, they're very similar. And as the distance from the first occurrence increases, so does the advantage of the naive version.

The current bytes.IndexAny algorithm is somewhat optimized for the case where the characters being searched for (chars) are all ASCII. It may be that we can do better for cases that is not true.

I'm really surprised by what you say, because I discovered this naive version precisely because I wanted to treat only the case of ASCII separators, and I was surprised that in indexAny the separator set is "string" and not []byte. And indexAnyByte (or indexByteAny) is missing!

If we only want to handle ASCII delimiters, the following []byte version is even faster

func IndexAnyByte(s, chars []byte) int {
    min := -1
    for _, c := range chars {
        if i := bytes.IndexByte(s, c); uint(i) < uint(min) {
            min = i
        }
    }
    return min
}

I really don't understand the choice of the string parameter in the bytes.IndexAny (vs. strings.IndexAny). Why bother with all the Unicode complexity in this case? Anyway, that's not the point of this issue, and we can't change it now (for backward compatibility reasons).

ianlancetaylor commented 1 year ago

I didn't find any benchmarks for indexAny in the standard library. Am I missing something?

go test -test.bench=BenchmarkIndexAny bytes

ianlancetaylor commented 1 year ago

But in all benchmarks with a small number of separators, even if they are all ASCII, the naive version is (much) faster.

This surprises me. The current algorithm only looks through the string once. Do you have an explanation for why it is faster to look through it multiple times? Is this a memory caching effect?

mknyszek commented 1 year ago

I got curious so I'm running the benchmarks in the bytes package with both implementations. I'll post the results back here.

mknyszek commented 1 year ago

Here's what I got:

bytes $ benchstat baseline.txt naive.txt
name                    old time/op  new time/op    delta
IndexAnyASCII/1:1-8     6.04ns ± 1%    8.20ns ± 1%    +35.81%  (p=0.008 n=5+5)
IndexAnyASCII/1:2-8     6.03ns ± 1%   13.19ns ± 0%   +118.85%  (p=0.008 n=5+5)
IndexAnyASCII/1:4-8     6.02ns ± 1%   23.26ns ± 1%   +286.17%  (p=0.008 n=5+5)
IndexAnyASCII/1:8-8     6.02ns ± 0%   43.77ns ± 1%   +627.49%  (p=0.008 n=5+5)
IndexAnyASCII/1:16-8    6.29ns ± 1%   84.67ns ± 1%  +1245.89%  (p=0.008 n=5+5)
IndexAnyASCII/1:32-8    7.56ns ± 2%  177.14ns ± 0%  +2243.25%  (p=0.008 n=5+5)
IndexAnyASCII/1:64-8    6.00ns ± 1%  338.48ns ± 1%  +5544.53%  (p=0.008 n=5+5)
IndexAnyASCII/16:1-8    7.58ns ± 1%    8.47ns ± 1%    +11.83%  (p=0.008 n=5+5)
IndexAnyASCII/16:2-8    19.2ns ± 1%    13.8ns ± 0%    -28.51%  (p=0.008 n=5+5)
IndexAnyASCII/16:4-8    22.6ns ± 1%    24.4ns ± 0%     +8.18%  (p=0.008 n=5+5)
IndexAnyASCII/16:8-8    28.4ns ± 1%    45.7ns ± 1%    +61.00%  (p=0.008 n=5+5)
IndexAnyASCII/16:16-8   41.8ns ± 0%    88.6ns ± 0%   +111.92%  (p=0.008 n=5+5)
IndexAnyASCII/16:32-8   57.6ns ± 0%   175.7ns ± 2%   +204.86%  (p=0.008 n=5+5)
IndexAnyASCII/16:64-8    145ns ± 1%     353ns ± 0%   +143.91%  (p=0.008 n=5+5)
IndexAnyASCII/256:1-8   9.64ns ± 1%   10.44ns ± 1%     +8.39%  (p=0.008 n=5+5)
IndexAnyASCII/256:2-8    182ns ± 1%      18ns ± 1%    -90.27%  (p=0.008 n=5+5)
IndexAnyASCII/256:4-8    188ns ± 1%      32ns ± 1%    -82.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:8-8    198ns ± 0%      62ns ± 2%    -68.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:16-8   207ns ± 4%     130ns ± 1%    -37.37%  (p=0.008 n=5+5)
IndexAnyASCII/256:32-8   216ns ± 1%     249ns ± 1%    +15.24%  (p=0.008 n=5+5)
IndexAnyASCII/256:64-8   306ns ± 2%     483ns ± 2%    +57.72%  (p=0.008 n=5+5)
IndexAnyUTF8/1:1-8      5.67ns ± 1%   12.37ns ± 2%   +118.07%  (p=0.008 n=5+5)
IndexAnyUTF8/1:2-8      5.69ns ± 0%   21.32ns ± 0%   +274.51%  (p=0.008 n=5+5)
IndexAnyUTF8/1:4-8      5.67ns ± 0%   23.92ns ± 1%   +321.79%  (p=0.008 n=5+5)
IndexAnyUTF8/1:8-8      5.66ns ± 1%   44.28ns ± 1%   +682.79%  (p=0.008 n=5+5)
IndexAnyUTF8/1:16-8     5.94ns ± 0%   69.25ns ± 1%  +1065.47%  (p=0.008 n=5+5)
IndexAnyUTF8/1:32-8     7.19ns ± 0%  155.06ns ± 1%  +2055.29%  (p=0.008 n=5+5)
IndexAnyUTF8/1:64-8     5.68ns ± 0%  300.56ns ± 1%  +5195.46%  (p=0.008 n=5+5)
IndexAnyUTF8/16:1-8     57.1ns ± 1%    61.3ns ± 1%     +7.43%  (p=0.008 n=5+5)
IndexAnyUTF8/16:2-8     65.0ns ± 0%   119.9ns ± 1%    +84.53%  (p=0.008 n=5+5)
IndexAnyUTF8/16:4-8     65.1ns ± 1%    76.4ns ± 0%    +17.40%  (p=0.008 n=5+5)
IndexAnyUTF8/16:8-8     65.3ns ± 1%   148.9ns ± 1%   +128.23%  (p=0.008 n=5+5)
IndexAnyUTF8/16:16-8    64.0ns ± 0%    91.8ns ± 0%    +43.45%  (p=0.008 n=5+5)
IndexAnyUTF8/16:32-8    83.8ns ± 0%   277.9ns ± 1%   +231.56%  (p=0.008 n=5+5)
IndexAnyUTF8/16:64-8    66.2ns ± 0%   414.2ns ± 1%   +525.28%  (p=0.008 n=5+5)
IndexAnyUTF8/256:1-8     838ns ± 1%     838ns ± 2%       ~     (p=1.000 n=5+5)
IndexAnyUTF8/256:2-8     973ns ± 1%    1704ns ± 2%    +75.16%  (p=0.008 n=5+5)
IndexAnyUTF8/256:4-8     973ns ± 1%     861ns ± 1%    -11.53%  (p=0.008 n=5+5)
IndexAnyUTF8/256:8-8     971ns ± 0%    1726ns ± 1%    +77.80%  (p=0.008 n=5+5)
IndexAnyUTF8/256:16-8    970ns ± 0%     109ns ± 1%    -88.77%  (p=0.008 n=5+5)
IndexAnyUTF8/256:32-8   1.29µs ± 0%    1.90µs ± 1%    +47.48%  (p=0.008 n=5+5)
IndexAnyUTF8/256:64-8   1.01µs ± 1%    1.29µs ± 1%    +28.61%  (p=0.008 n=5+5)

This is pretty interesting, assuming it's reproducible. Largely the implementation in the bytes package is faster (sometimes dramatically so). But there are a number of cases where the naive version does better. Picking them out:

IndexAnyASCII/16:2-8    19.2ns ± 1%    13.8ns ± 0%    -28.51%  (p=0.008 n=5+5)
IndexAnyASCII/256:2-8    182ns ± 1%      18ns ± 1%    -90.27%  (p=0.008 n=5+5)
IndexAnyASCII/256:4-8    188ns ± 1%      32ns ± 1%    -82.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:8-8    198ns ± 0%      62ns ± 2%    -68.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:16-8   207ns ± 4%     130ns ± 1%    -37.37%  (p=0.008 n=5+5)
IndexAnyUTF8/256:4-8     973ns ± 1%     861ns ± 1%    -11.53%  (p=0.008 n=5+5)
IndexAnyUTF8/256:16-8    970ns ± 0%     109ns ± 1%    -88.77%  (p=0.008 n=5+5)

I don't know what's particularly interesting about these cases.

Benchmark platform:

goos: linux
goarch: amd64
cpu: AMD EPYC 7B12

kpym commented 1 year ago

@mknyszek I have the impression that the cases where bytes.IndexAny is faster are the "exceptional" cases for which the naive version is not optimized: when the data is really short (shorter or comparable with the number of separators).

If we look at the cases where the naive version exceeds are the cases where the data size is much larger than the number of separators, it seems to me. And this feels closer to actual use.

kpym commented 1 year ago

@ianlancetaylor

This surprises me. The current algorithm only looks through the string once. Do you have an explanation for why it is faster to look through it multiple times? Is this a memory caching effect?

In any case we have a double loop, over the separators and over the data. Let n be the number of separators and m be the first occurrence of a separator in the data.

Theoretically, without taking optimizations into account, especially if n<m, it's in our interest to loop over the delimiters first, because if the first delimiter found is on average at position n/2, we avoid half of the m*n/2 comparisons, while in the other direction we avoid only n/2 of the n*m comparisons.

I don't think this explains the dominance of the naive version in my benchmarks, since I put the "founded" separator in the last position.

A possible explanation for the advantage of the naive version in situations where the number of separators is smaller than the first occurrence is the fact that the bytes.IndexByte (or bytes.IndexRune) function is highly optimized. And in the "naive" version, we take advantage of this optimization by keeping the "short" loop and replacing the "long" loop with this optimization.

Notes:

I think in practice we're more often in the case where the number of delimiters is less than the data. The test in the standard libray contains a lot of extreme cases: very short data with no separators.
If we sort the separators by their probability of occurrence, this will improve performance even more in the "naive" version where we loop over the separators first.

kpym commented 1 year ago

I just realized last night that I was wrong in my previous message and that the benchmarks, both mine and the two in the standard library, are no good. They all suffer from the same problem: it's not normal to consider only cases where the separator is absent from the data or it's in the last position.

Let's consider that the separator in position i < n is in position j < m in the data and that the other separators are not present in the data. If we loop on the separators first, we'll have to make i*m comparisons, whereas if we loop on the data first, we can stop after n*j comparisons.

The situation is actually much more complicated, as the position of all the separators comes into play. We can try to make theoretical predictions by approximating the behavior of j by a Poisson distribution and that of i by a uniform distribution, for example, but it seems to me much simpler to design realistic benchmarks than to theorize the question.

I think a realistic benchmark could be

a simple CSV example where there are fields of different lengths (including empty ones) and where we scan for separators one after the other,
a template where we look for "special" characters.

In all cases, we'll limit ourselves to 2, 3 or 4 separators (which seem to be the use cases on GitHub) and to data which is scanned to find all the separators one after the other. I'll try to design a more realistic benchmark and see how it goes.

kpym commented 1 year ago

Using the following benchmark

func BenchmarkIndexAnyCSV(b *testing.B) {
    // A csv dataset with 7 columns (of diffrent lengths) and 10 rows.
    // Some of the fields (gender) are empty.
    // We will check with this data repetaed 1, 10 and 100 times.
    // This data is fake and was generated with https://www.convertcsv.com/generate-test-data.htm.
    var csvdata = []byte(`seq,name,adress,email,birthdate,gender,phone
    1,"May Moody",adress,gofpuwvi@wolamme.gp,08/03/2014,F,(313) 321-8728
    2,"Adele Martinez",adress,nohvahac@du.ai,02/09/1993,,(785) 834-8091
    3,"Zachary Brown",adress,bo@covih.yt,07/17/1901,M,(308) 643-7802
    4,"Lucy Lindsey",adress,sa@oh.fi,04/22/2023,F,(375) 395-3248
    5,"Lucy Oliver",adress,gemjivefa@ha.bd,07/21/1995,M,(422) 714-5921
    6,"Hilda Mullins",adress,kiujma@bok.kh,05/13/1922,,(716) 645-7631
    7,"Mollie Parker",adress,kir@kahamrew.io,03/18/1960,M,(534) 818-5916
    8,"Katie Adkins",adress,konocizah@jibfu.et,02/01/1965,F,(828) 544-2687
    9,"Maurice Carroll",adress,fosmutlo@erpewi.dm,05/01/1958,M,(478) 249-3000
    10,"Lettie Harvey",adress,lircuged@pe.bs,07/04/2008,,(571) 628-2216`)
    // We will check with the first 2, 3 and 4 delimiters of the following list.
    // The 4th is not present in the data.
    var delimiters = ",\n\";"

    for _, k := range []int{1, 10, 100} {
        for _, j := range []int{2, 3, 4} {
            b.Run(fmt.Sprintf("%d:%d", k, j), func(b *testing.B) {
                data := bytes.Repeat(csvdata, k)
                separators := delimiters[:j]
                for i := 0; i < b.N; i++ {
                    p, q := 0, 0
                    for {
                        if q = IndexAny(data[p:], separators); q == -1 {
                            break
                        }
                        p += q + 1
                    }
                }
            })
        }
    }
}

I get the following results

$ benchstat standard.txt naive.txt
name                 old time/op  new time/op   delta
IndexAnyCSV/1:2-8    1.42µs ± 2%   1.30µs ± 1%     -8.29%  (p=0.029 n=4+4)
IndexAnyCSV/1:3-8    1.68µs ± 2%   2.34µs ± 1%    +38.98%  (p=0.029 n=4+4)
IndexAnyCSV/1:4-8    1.79µs ± 2%   4.13µs ± 2%   +130.71%  (p=0.029 n=4+4)
IndexAnyCSV/10:2-8   13.9µs ± 4%   14.0µs ± 4%       ~     (p=0.886 n=4+4)
IndexAnyCSV/10:3-8   16.4µs ± 2%   24.1µs ± 1%    +46.48%  (p=0.029 n=4+4)
IndexAnyCSV/10:4-8   17.8µs ± 2%  119.0µs ± 1%   +566.80%  (p=0.029 n=4+4)
IndexAnyCSV/100:2-8   139µs ± 4%    138µs ± 5%       ~     (p=0.686 n=4+4)
IndexAnyCSV/100:3-8   164µs ± 2%    243µs ± 2%    +48.53%  (p=0.029 n=4+4)
IndexAnyCSV/100:4-8   179µs ± 2%  11008µs ± 3%  +6036.16%  (p=0.029 n=4+4)

My conclusion is that the current version is better than the naive version in real-life situations. So there's no need to look any further.

On the other hand, I think we need to add a more realistic benchmark (like this one, for example) to the standard library (for both bytes and strings).

mknyszek commented 1 year ago

FTR, we try to aggregate more realistic benchmarks in golang.org/x/benchmarks. These run continuously in CI (https://perf.golang.org/dashboard). If you write a benchmark that is executable with go get, we can consider referencing it in bent cmd/bent (https://cs.opensource.google/go/x/benchmarks/+/master:cmd/bent/configs/suites.toml). If you have an even larger, more complex text processing benchmark, it might be worth adding to sweet.

It sounds like there might not be anything left to do in this issue? Closing for now, please let me know if you disagree!

ianlancetaylor commented 1 year ago

@kpym Thanks for the investigation.

mknyszek commented 1 year ago

+1, thanks for the investigation!

kpym commented 1 year ago

Just for the sake of completeness, here is a better "naive" version, which is still slower than the standard bytes.IndexAny for the csv benchmark in general, but seems to be faster for two delimiters. This version avoids looking above the current minimum value.

func IndexAny(s []byte, chars string) int {
    min, cut := -1, len(s)
    for _, c := range chars {
        if i := bytes.IndexRune(s[:cut], c); uint(i) < uint(min) {
            min = i
            cut = i
        }
    }
    return min
}

name                 old time/op  new time/op  delta
IndexAnyCSV/1:2-8    1.46µs ± 9%  1.32µs ± 2%   -9.37%  (p=0.029 n=4+4)
IndexAnyCSV/1:3-8    1.66µs ± 6%  2.17µs ± 2%  +30.53%  (p=0.029 n=4+4)
IndexAnyCSV/1:4-8    1.74µs ± 2%  2.75µs ± 3%  +58.16%  (p=0.029 n=4+4)
IndexAnyCSV/10:2-8   13.6µs ± 1%  13.6µs ± 1%     ~     (p=0.486 n=4+4)
IndexAnyCSV/10:3-8   16.8µs ± 2%  21.8µs ± 1%  +30.08%  (p=0.029 n=4+4)
IndexAnyCSV/10:4-8   18.2µs ± 1%  27.4µs ± 2%  +50.51%  (p=0.029 n=4+4)
IndexAnyCSV/100:2-8   146µs ± 5%   137µs ± 2%   -5.96%  (p=0.029 n=4+4)
IndexAnyCSV/100:3-8   165µs ± 1%   221µs ± 2%  +34.37%  (p=0.029 n=4+4)
IndexAnyCSV/100:4-8   175µs ± 0%   285µs ± 4%  +62.70%  (p=0.029 n=4+4)

golang / go

bytes: IndexAny is slower than a naive implementation in some cases #60550

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?