Closed kpym closed 1 year ago
CC @ianlancetaylor via https://dev.golang.org/owners
Any algorithm is going to have cases where it is better or worse. The question is how to get good average performance for typical cases.
How does your implementation fare on the benchmarks that are already in the bytes package?
The current bytes.IndexAny
algorithm is somewhat optimized for the case where the characters being searched for (chars
) are all ASCII. It may be that we can do better for cases that is not true.
@ianlancetaylor
Any algorithm is going to have cases where it is better or worse. The question is how to get good average performance for typical cases.
I agree. When I searched for bytes.IndexAny
on GitHub and in the standard library, it looks like that it is used in almost all the cases :
"\r\n"
, " \t\r\n"
or " ,;\r\n"
).How does your implementation fare on the benchmarks that are already in the bytes package?
I didn't find any benchmarks for indexAny
in the standard library. Am I missing something?
But in all benchmarks with a small number of separators, even if they are all ASCII, the naive version is (much) faster. Except when the first occurrence is really close to the beginning (less than 4). But in this case, they're very similar. And as the distance from the first occurrence increases, so does the advantage of the naive version.
The current
bytes.IndexAny
algorithm is somewhat optimized for the case where the characters being searched for (chars
) are all ASCII. It may be that we can do better for cases that is not true.
I'm really surprised by what you say, because I discovered this naive version precisely because I wanted to treat only the case of ASCII separators, and I was surprised that in indexAny
the separator set is "string" and not []byte
. And indexAnyByte
(or indexByteAny
) is missing!
If we only want to handle ASCII delimiters, the following []byte
version is even faster
func IndexAnyByte(s, chars []byte) int {
min := -1
for _, c := range chars {
if i := bytes.IndexByte(s, c); uint(i) < uint(min) {
min = i
}
}
return min
}
I really don't understand the choice of the string
parameter in the bytes.IndexAny
(vs. strings.IndexAny
).
Why bother with all the Unicode complexity in this case?
Anyway, that's not the point of this issue, and we can't change it now (for backward compatibility reasons).
I didn't find any benchmarks for indexAny in the standard library. Am I missing something?
go test -test.bench=BenchmarkIndexAny bytes
But in all benchmarks with a small number of separators, even if they are all ASCII, the naive version is (much) faster.
This surprises me. The current algorithm only looks through the string once. Do you have an explanation for why it is faster to look through it multiple times? Is this a memory caching effect?
I got curious so I'm running the benchmarks in the bytes
package with both implementations. I'll post the results back here.
Here's what I got:
bytes $ benchstat baseline.txt naive.txt
name old time/op new time/op delta
IndexAnyASCII/1:1-8 6.04ns ± 1% 8.20ns ± 1% +35.81% (p=0.008 n=5+5)
IndexAnyASCII/1:2-8 6.03ns ± 1% 13.19ns ± 0% +118.85% (p=0.008 n=5+5)
IndexAnyASCII/1:4-8 6.02ns ± 1% 23.26ns ± 1% +286.17% (p=0.008 n=5+5)
IndexAnyASCII/1:8-8 6.02ns ± 0% 43.77ns ± 1% +627.49% (p=0.008 n=5+5)
IndexAnyASCII/1:16-8 6.29ns ± 1% 84.67ns ± 1% +1245.89% (p=0.008 n=5+5)
IndexAnyASCII/1:32-8 7.56ns ± 2% 177.14ns ± 0% +2243.25% (p=0.008 n=5+5)
IndexAnyASCII/1:64-8 6.00ns ± 1% 338.48ns ± 1% +5544.53% (p=0.008 n=5+5)
IndexAnyASCII/16:1-8 7.58ns ± 1% 8.47ns ± 1% +11.83% (p=0.008 n=5+5)
IndexAnyASCII/16:2-8 19.2ns ± 1% 13.8ns ± 0% -28.51% (p=0.008 n=5+5)
IndexAnyASCII/16:4-8 22.6ns ± 1% 24.4ns ± 0% +8.18% (p=0.008 n=5+5)
IndexAnyASCII/16:8-8 28.4ns ± 1% 45.7ns ± 1% +61.00% (p=0.008 n=5+5)
IndexAnyASCII/16:16-8 41.8ns ± 0% 88.6ns ± 0% +111.92% (p=0.008 n=5+5)
IndexAnyASCII/16:32-8 57.6ns ± 0% 175.7ns ± 2% +204.86% (p=0.008 n=5+5)
IndexAnyASCII/16:64-8 145ns ± 1% 353ns ± 0% +143.91% (p=0.008 n=5+5)
IndexAnyASCII/256:1-8 9.64ns ± 1% 10.44ns ± 1% +8.39% (p=0.008 n=5+5)
IndexAnyASCII/256:2-8 182ns ± 1% 18ns ± 1% -90.27% (p=0.008 n=5+5)
IndexAnyASCII/256:4-8 188ns ± 1% 32ns ± 1% -82.74% (p=0.008 n=5+5)
IndexAnyASCII/256:8-8 198ns ± 0% 62ns ± 2% -68.74% (p=0.008 n=5+5)
IndexAnyASCII/256:16-8 207ns ± 4% 130ns ± 1% -37.37% (p=0.008 n=5+5)
IndexAnyASCII/256:32-8 216ns ± 1% 249ns ± 1% +15.24% (p=0.008 n=5+5)
IndexAnyASCII/256:64-8 306ns ± 2% 483ns ± 2% +57.72% (p=0.008 n=5+5)
IndexAnyUTF8/1:1-8 5.67ns ± 1% 12.37ns ± 2% +118.07% (p=0.008 n=5+5)
IndexAnyUTF8/1:2-8 5.69ns ± 0% 21.32ns ± 0% +274.51% (p=0.008 n=5+5)
IndexAnyUTF8/1:4-8 5.67ns ± 0% 23.92ns ± 1% +321.79% (p=0.008 n=5+5)
IndexAnyUTF8/1:8-8 5.66ns ± 1% 44.28ns ± 1% +682.79% (p=0.008 n=5+5)
IndexAnyUTF8/1:16-8 5.94ns ± 0% 69.25ns ± 1% +1065.47% (p=0.008 n=5+5)
IndexAnyUTF8/1:32-8 7.19ns ± 0% 155.06ns ± 1% +2055.29% (p=0.008 n=5+5)
IndexAnyUTF8/1:64-8 5.68ns ± 0% 300.56ns ± 1% +5195.46% (p=0.008 n=5+5)
IndexAnyUTF8/16:1-8 57.1ns ± 1% 61.3ns ± 1% +7.43% (p=0.008 n=5+5)
IndexAnyUTF8/16:2-8 65.0ns ± 0% 119.9ns ± 1% +84.53% (p=0.008 n=5+5)
IndexAnyUTF8/16:4-8 65.1ns ± 1% 76.4ns ± 0% +17.40% (p=0.008 n=5+5)
IndexAnyUTF8/16:8-8 65.3ns ± 1% 148.9ns ± 1% +128.23% (p=0.008 n=5+5)
IndexAnyUTF8/16:16-8 64.0ns ± 0% 91.8ns ± 0% +43.45% (p=0.008 n=5+5)
IndexAnyUTF8/16:32-8 83.8ns ± 0% 277.9ns ± 1% +231.56% (p=0.008 n=5+5)
IndexAnyUTF8/16:64-8 66.2ns ± 0% 414.2ns ± 1% +525.28% (p=0.008 n=5+5)
IndexAnyUTF8/256:1-8 838ns ± 1% 838ns ± 2% ~ (p=1.000 n=5+5)
IndexAnyUTF8/256:2-8 973ns ± 1% 1704ns ± 2% +75.16% (p=0.008 n=5+5)
IndexAnyUTF8/256:4-8 973ns ± 1% 861ns ± 1% -11.53% (p=0.008 n=5+5)
IndexAnyUTF8/256:8-8 971ns ± 0% 1726ns ± 1% +77.80% (p=0.008 n=5+5)
IndexAnyUTF8/256:16-8 970ns ± 0% 109ns ± 1% -88.77% (p=0.008 n=5+5)
IndexAnyUTF8/256:32-8 1.29µs ± 0% 1.90µs ± 1% +47.48% (p=0.008 n=5+5)
IndexAnyUTF8/256:64-8 1.01µs ± 1% 1.29µs ± 1% +28.61% (p=0.008 n=5+5)
This is pretty interesting, assuming it's reproducible. Largely the implementation in the bytes package is faster (sometimes dramatically so). But there are a number of cases where the naive version does better. Picking them out:
IndexAnyASCII/16:2-8 19.2ns ± 1% 13.8ns ± 0% -28.51% (p=0.008 n=5+5)
IndexAnyASCII/256:2-8 182ns ± 1% 18ns ± 1% -90.27% (p=0.008 n=5+5)
IndexAnyASCII/256:4-8 188ns ± 1% 32ns ± 1% -82.74% (p=0.008 n=5+5)
IndexAnyASCII/256:8-8 198ns ± 0% 62ns ± 2% -68.74% (p=0.008 n=5+5)
IndexAnyASCII/256:16-8 207ns ± 4% 130ns ± 1% -37.37% (p=0.008 n=5+5)
IndexAnyUTF8/256:4-8 973ns ± 1% 861ns ± 1% -11.53% (p=0.008 n=5+5)
IndexAnyUTF8/256:16-8 970ns ± 0% 109ns ± 1% -88.77% (p=0.008 n=5+5)
I don't know what's particularly interesting about these cases.
Benchmark platform:
goos: linux
goarch: amd64
cpu: AMD EPYC 7B12
@mknyszek I have the impression that the cases where bytes.IndexAny
is faster are the "exceptional" cases for which the naive version is not optimized: when the data is really short (shorter or comparable with the number of separators).
If we look at the cases where the naive version exceeds are the cases where the data size is much larger than the number of separators, it seems to me. And this feels closer to actual use.
@ianlancetaylor
This surprises me. The current algorithm only looks through the string once. Do you have an explanation for why it is faster to look through it multiple times? Is this a memory caching effect?
In any case we have a double loop, over the separators and over the data. Let n
be the number of separators and m
be the first occurrence of a separator in the data.
Theoretically, without taking optimizations into account, especially if n<m
, it's in our interest to loop over the delimiters first, because if the first delimiter found is on average at position n/2
, we avoid half of the m*n/2
comparisons, while in the other direction we avoid only n/2
of the n*m
comparisons.
I don't think this explains the dominance of the naive version in my benchmarks, since I put the "founded" separator in the last position.
A possible explanation for the advantage of the naive version in situations where the number of separators is smaller than the first occurrence is the fact that the bytes.IndexByte
(or bytes.IndexRune
) function is highly optimized. And in the "naive" version, we take advantage of this optimization by keeping the "short" loop and replacing the "long" loop with this optimization.
Notes:
I just realized last night that I was wrong in my previous message and that the benchmarks, both mine and the two in the standard library, are no good. They all suffer from the same problem: it's not normal to consider only cases where the separator is absent from the data or it's in the last position.
Let's consider that the separator in position i < n
is in position j < m
in the data and that the other separators are not present in the data. If we loop on the separators first, we'll have to make i*m
comparisons, whereas if we loop on the data first, we can stop after n*j
comparisons.
The situation is actually much more complicated, as the position of all the separators comes into play. We can try to make theoretical predictions by approximating the behavior of j
by a Poisson distribution and that of i
by a uniform distribution, for example, but it seems to me much simpler to design realistic benchmarks than to theorize the question.
I think a realistic benchmark could be
In all cases, we'll limit ourselves to 2, 3 or 4 separators (which seem to be the use cases on GitHub) and to data which is scanned to find all the separators one after the other. I'll try to design a more realistic benchmark and see how it goes.
Using the following benchmark
func BenchmarkIndexAnyCSV(b *testing.B) {
// A csv dataset with 7 columns (of diffrent lengths) and 10 rows.
// Some of the fields (gender) are empty.
// We will check with this data repetaed 1, 10 and 100 times.
// This data is fake and was generated with https://www.convertcsv.com/generate-test-data.htm.
var csvdata = []byte(`seq,name,adress,email,birthdate,gender,phone
1,"May Moody",adress,gofpuwvi@wolamme.gp,08/03/2014,F,(313) 321-8728
2,"Adele Martinez",adress,nohvahac@du.ai,02/09/1993,,(785) 834-8091
3,"Zachary Brown",adress,bo@covih.yt,07/17/1901,M,(308) 643-7802
4,"Lucy Lindsey",adress,sa@oh.fi,04/22/2023,F,(375) 395-3248
5,"Lucy Oliver",adress,gemjivefa@ha.bd,07/21/1995,M,(422) 714-5921
6,"Hilda Mullins",adress,kiujma@bok.kh,05/13/1922,,(716) 645-7631
7,"Mollie Parker",adress,kir@kahamrew.io,03/18/1960,M,(534) 818-5916
8,"Katie Adkins",adress,konocizah@jibfu.et,02/01/1965,F,(828) 544-2687
9,"Maurice Carroll",adress,fosmutlo@erpewi.dm,05/01/1958,M,(478) 249-3000
10,"Lettie Harvey",adress,lircuged@pe.bs,07/04/2008,,(571) 628-2216`)
// We will check with the first 2, 3 and 4 delimiters of the following list.
// The 4th is not present in the data.
var delimiters = ",\n\";"
for _, k := range []int{1, 10, 100} {
for _, j := range []int{2, 3, 4} {
b.Run(fmt.Sprintf("%d:%d", k, j), func(b *testing.B) {
data := bytes.Repeat(csvdata, k)
separators := delimiters[:j]
for i := 0; i < b.N; i++ {
p, q := 0, 0
for {
if q = IndexAny(data[p:], separators); q == -1 {
break
}
p += q + 1
}
}
})
}
}
}
I get the following results
$ benchstat standard.txt naive.txt
name old time/op new time/op delta
IndexAnyCSV/1:2-8 1.42µs ± 2% 1.30µs ± 1% -8.29% (p=0.029 n=4+4)
IndexAnyCSV/1:3-8 1.68µs ± 2% 2.34µs ± 1% +38.98% (p=0.029 n=4+4)
IndexAnyCSV/1:4-8 1.79µs ± 2% 4.13µs ± 2% +130.71% (p=0.029 n=4+4)
IndexAnyCSV/10:2-8 13.9µs ± 4% 14.0µs ± 4% ~ (p=0.886 n=4+4)
IndexAnyCSV/10:3-8 16.4µs ± 2% 24.1µs ± 1% +46.48% (p=0.029 n=4+4)
IndexAnyCSV/10:4-8 17.8µs ± 2% 119.0µs ± 1% +566.80% (p=0.029 n=4+4)
IndexAnyCSV/100:2-8 139µs ± 4% 138µs ± 5% ~ (p=0.686 n=4+4)
IndexAnyCSV/100:3-8 164µs ± 2% 243µs ± 2% +48.53% (p=0.029 n=4+4)
IndexAnyCSV/100:4-8 179µs ± 2% 11008µs ± 3% +6036.16% (p=0.029 n=4+4)
My conclusion is that the current version is better than the naive version in real-life situations. So there's no need to look any further.
On the other hand, I think we need to add a more realistic benchmark (like this one, for example) to the standard library (for both bytes
and strings
).
FTR, we try to aggregate more realistic benchmarks in golang.org/x/benchmarks
. These run continuously in CI (https://perf.golang.org/dashboard). If you write a benchmark that is executable with go get
, we can consider referencing it in bent cmd/bent
(https://cs.opensource.google/go/x/benchmarks/+/master:cmd/bent/configs/suites.toml). If you have an even larger, more complex text processing benchmark, it might be worth adding to sweet
.
It sounds like there might not be anything left to do in this issue? Closing for now, please let me know if you disagree!
@kpym Thanks for the investigation.
+1, thanks for the investigation!
Just for the sake of completeness, here is a better "naive" version, which is still slower than the standard bytes.IndexAny
for the csv benchmark in general, but seems to be faster for two delimiters. This version avoids looking above the current minimum value.
func IndexAny(s []byte, chars string) int {
min, cut := -1, len(s)
for _, c := range chars {
if i := bytes.IndexRune(s[:cut], c); uint(i) < uint(min) {
min = i
cut = i
}
}
return min
}
name old time/op new time/op delta
IndexAnyCSV/1:2-8 1.46µs ± 9% 1.32µs ± 2% -9.37% (p=0.029 n=4+4)
IndexAnyCSV/1:3-8 1.66µs ± 6% 2.17µs ± 2% +30.53% (p=0.029 n=4+4)
IndexAnyCSV/1:4-8 1.74µs ± 2% 2.75µs ± 3% +58.16% (p=0.029 n=4+4)
IndexAnyCSV/10:2-8 13.6µs ± 1% 13.6µs ± 1% ~ (p=0.486 n=4+4)
IndexAnyCSV/10:3-8 16.8µs ± 2% 21.8µs ± 1% +30.08% (p=0.029 n=4+4)
IndexAnyCSV/10:4-8 18.2µs ± 1% 27.4µs ± 2% +50.51% (p=0.029 n=4+4)
IndexAnyCSV/100:2-8 146µs ± 5% 137µs ± 2% -5.96% (p=0.029 n=4+4)
IndexAnyCSV/100:3-8 165µs ± 1% 221µs ± 2% +34.37% (p=0.029 n=4+4)
IndexAnyCSV/100:4-8 175µs ± 0% 285µs ± 4% +62.70% (p=0.029 n=4+4)
What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I consider the following naive implementation of
bytes.IndexAny
I benchmark it for 10 delimiters on a data where the first delimiter appears at position 1000, like this
What did you expect to see?
That
bytes.IndexAny
is faster then this "naive"IndexAny
.What did you see instead?
So this "naive" implementation is more than 10 times faster than the standard
bytes.IndexAny
.Am I doing something wrong ? Is this relevant ?
Remark : If we take a lot of delimiters (1000) the standard function is 30% better (on my computer).