shepmaster / jetscii

A tiny library to efficiently search strings for sets of ASCII characters and byte slices for sets of bytes.
Apache License 2.0
113 stars 20 forks source link

Improve substring search performance for corner cases #11

Open bluss opened 9 years ago

bluss commented 9 years ago

I've subjected jetscii 0.3.1 to a benchmark suite of just a few pathological examples for substring search.

_jetscii benches very well_, there's just evidence of pathlogical cases

Result (may be noisy, i.e. individual benchmarks may be flukes)

test bad_naive::find                                      ... bench:         274 ns/iter (+/- 19) = 270 MB/s
test bad_naive::jetscii_find                              ... bench:         504 ns/iter (+/- 118) = 146 MB/s
test bad_naive::pcmp_find                                 ... bench:          96 ns/iter (+/- 1) = 770 MB/s
test bad_naive::regex_find                                ... bench:       1,077 ns/iter (+/- 89) = 68 MB/s
test bad_naive::rfind                                     ... bench:         231 ns/iter (+/- 16) = 320 MB/s
test bad_naive_reversed::find                             ... bench:         175 ns/iter (+/- 9) = 422 MB/s
test bad_naive_reversed::jetscii_find                     ... bench:          54 ns/iter (+/- 4) = 1370 MB/s
test bad_naive_reversed::pcmp_find                        ... bench:          43 ns/iter (+/- 4) = 1720 MB/s
test bad_naive_reversed::regex_find                       ... bench:          36 ns/iter (+/- 3) = 2055 MB/s
test bad_naive_reversed::rfind                            ... bench:         367 ns/iter (+/- 54) = 201 MB/s
test pathological_aa_100k::find                           ... bench:       3,084 ns/iter (+/- 149) = 32425 MB/s
test pathological_aa_100k::jetscii_find                   ... bench:      24,116 ns/iter (+/- 1,041) = 4146 MB/s
test pathological_aa_100k::pcmp_find                      ... bench:      27,241 ns/iter (+/- 2,031) = 3670 MB/s
test pathological_aa_100k::regex_find                     ... bench:       4,453 ns/iter (+/- 262) = 22456 MB/s
test pathological_aa_100k::rfind                          ... bench:       2,824 ns/iter (+/- 207) = 35410 MB/s
test pathological_aab_100k::find                          ... bench:     889,518 ns/iter (+/- 55,787) = 337 MB/s
test pathological_aab_100k::jetscii_find                  ... bench:     265,300 ns/iter (+/- 15,265) = 1130 MB/s
test pathological_aab_100k::pcmp_find                     ... bench:     183,135 ns/iter (+/- 11,808) = 1638 MB/s
test pathological_aab_100k::regex_find                    ... bench:   3,750,744 ns/iter (+/- 392,444) = 79 MB/s
test pathological_aab_100k::rfind                         ... bench:     859,327 ns/iter (+/- 104,678) = 348 MB/s
test pathological_two_way_10k::find                       ... bench:     107,896 ns/iter (+/- 48,497) = 278 MB/s
test pathological_two_way_10k::jetscii_find               ... bench:       7,406 ns/iter (+/- 1,072) = 4050 MB/s
test pathological_two_way_10k::pcmp_find                  ... bench:      19,444 ns/iter (+/- 22,302) = 1542 MB/s
test pathological_two_way_10k::regex_find                 ... bench:       1,329 ns/iter (+/- 2,413) = 22573 MB/s
test pathological_two_way_10k::rfind                      ... bench:       2,463 ns/iter (+/- 4,146) = 12180 MB/s
test pathological_two_way_20k::find                       ... bench:     305,465 ns/iter (+/- 505,176) = 327 MB/s
test pathological_two_way_20k::jetscii_find               ... bench:      25,179 ns/iter (+/- 39,142) = 3971 MB/s
test pathological_two_way_20k::pcmp_find                  ... bench:      64,983 ns/iter (+/- 113,250) = 1538 MB/s
test pathological_two_way_20k::regex_find                 ... bench:       6,175 ns/iter (+/- 12,215) = 16194 MB/s
test pathological_two_way_20k::rfind                      ... bench:       3,673 ns/iter (+/- 6,390) = 27225 MB/s
test pathological_two_way_20k_reversed::find              ... bench:       2,492 ns/iter (+/- 4,523) = 24077 MB/s
test pathological_two_way_20k_reversed::jetscii_find      ... bench:      45,084 ns/iter (+/- 80,989) = 1330 MB/s
test pathological_two_way_20k_reversed::pcmp_find         ... bench:      39,623 ns/iter (+/- 65,178) = 1514 MB/s
test pathological_two_way_20k_reversed::regex_find        ... bench:     406,637 ns/iter (+/- 730,079) = 147 MB/s
test pathological_two_way_20k_reversed::rfind             ... bench:     155,639 ns/iter (+/- 284,944) = 385 MB/s
test short_long::find                                     ... bench:       3,924 ns/iter (+/- 4,175) = 650 MB/s
test short_long::jetscii_find                             ... bench:         676 ns/iter (+/- 1,291) = 3773 MB/s
test short_long::pcmp_find                                ... bench:         810 ns/iter (+/- 1,531) = 3149 MB/s
test short_long::regex_find                               ... bench:       2,821 ns/iter (+/- 748) = 904 MB/s
test short_long::rfind                                    ... bench:       2,253 ns/iter (+/- 3,445) = 1132 MB/s
test short_short::find                                    ... bench:          68 ns/iter (+/- 119) = 823 MB/s
test short_short::jetscii_find                            ... bench:         109 ns/iter (+/- 127) = 513 MB/s
test short_short::pcmp_find                               ... bench:          43 ns/iter (+/- 69) = 1302 MB/s
test short_short::regex_find                              ... bench:          54 ns/iter (+/- 122) = 1037 MB/s
test short_short::rfind                                   ... bench:         125 ns/iter (+/- 160) = 448 MB/s

Conclusions:

Benchmark source, very messy crate

bluss commented 9 years ago

I realized the bad_naive testcase was a bit short -- the needle below 16 bytes. A proper testcase is a worse pathological case for jetscii!

haystack: (0..100_000).map(|_| "a").collect::<String>() needle: (0..100).map(|_| "a").collect::<String>() + "b"

test bad_naive_long::find                                 ... bench:     299,292 ns/iter (+/- 37,567) = 334 MB/s
test bad_naive_long::jetscii_find                         ... bench:  10,303,063 ns/iter (+/- 194,054) = 9 MB/s
test bad_naive_long::pcmp_find                            ... bench:      23,807 ns/iter (+/- 225) = 4200 MB/s

as consolation, regex is even worse..

shepmaster commented 9 years ago

Awesome, thanks for this! I'm certainly not surprised that there are pathological cases, but having them pointed out in test cases is even nicer.

One thing I assume will help is to not increment by one byte on a false positive. Any highlights of normal things to avoid before I read your tests and the standard lib implementation?

shepmaster commented 9 years ago

9 MB/s

Ouch. Lots to fix!

bluss commented 9 years ago

To avoid all algorithmic trouble, employ a substring search algorithm, i.e. some system to make sure it advances enough per trial. Here's an nice overview http://www-igm.univ-mlv.fr/~lecroq/string/

I don't know enough to have specific advice. I'm obviously trying to impl the two way algorithm with pcmp_find, but I'm not sure it's the best algorithm to pick for SSE4.2 substring search.

bluss commented 9 years ago

The MB/s is bytes of haystack searched per second, of course. Most of the testcases don't match, so they effectively search it all.

shepmaster commented 9 years ago

One thought is that the pcmpestrm instruction should be able to tell which byte fails strcmp, which I believe can be used to stride further.

bluss commented 9 years ago

absolutely, that's what you need to to for the two way algorithm.

Without the factorization that two-way does of the needle, I don't think you can step to the failed match position in general.

Not sure I can recommend diving into string search algorithms. :smile: My revival PR of two-way for libstd took me a week of work, and I didn't even implement it from scratch

bluss commented 9 years ago

That said, I think the pcmp_find is part way there. But I'd consider a different algorithm entirely maybe. Maybe Boyer-Moore-Horspool?

shepmaster commented 9 years ago

a different algorithm

Yeah, I'll skim through and see if anything jumps out as a key match. One nice thing is that you and I can collaborate on ideas and compete on implementations, and everyone wins!

bluss commented 9 years ago

Note that the byteset optimization is why str::find is so fast in the substring benchmark in your Readme. If you add an a anywhere in the needle, that optimization won't work.

ArtemGr commented 9 years ago

Have you seen strstr_sse42 in http://www.strchr.com/strcmp_and_strlen_using_sse_4.2?

bluss commented 9 years ago

@ArtemGr that's a quadratic algorithm too (it's a naive search)

; continue searching from the next byte

ArtemGr commented 9 years ago

I wondered if the next byte meant the one after the 15 already checked. Duh, at least I tried. )

BTW, you ever run those benchmars with native optimizations on (target=native)?

bluss commented 9 years ago

I didn't think about it. The pcmpestri instruction is emitted anyway since it's in an explicit asm!() block, so I'm not sure what could improve.

ArtemGr commented 9 years ago

The pcmpestri instruction is emitted anyway since it's in an explicit asm!() block, so I'm not sure what could improve.

The SSE version won't improve, but the non-SSE versions might. We could see how LLVM fares against manual assembly. Without "target=native" it can't compete.

bluss commented 9 years ago

That's true. Most of libcore's substring search (str::find) is bounds checked, and that often makes it impossible for the optimizer to do autovectorization. Crucially the things in the tightest loop in str::find are not very amenable to vectorization.

I think it's harder to vectorize the more advanced string search algorithm (by hand or by compiler). I've been experimenting with that, (pcmp_find) but my question is which string search algorithm is the best for this. Since I posted this my pcmp_find experiment has actually progressed, it's an improvement over str::find in every case now, but of course jetscii can beat it depending on the case.

In the easy cases, the simpler algorithm wins, in the tricky cases the string search algorithm should win.

I'm sure I've tried target-cpu=native with that at some point, and I'll do it again now.

bluss commented 9 years ago

when I run the benchmarks in the "twsearch" repo now, I see no difference in the twoway_find algorithm with native or not, unfortuately. It could probably be improved to be nicer for optimization some way, but the parts of the algorithm that are in the tightest loop (those parts here) are not so easy. The byteset check is random access of a single byte, and the "right part of needle" loop needs to count the number of bytes that matched.

ArtemGr commented 9 years ago

I hadn't much luck with "cargo rustc" so I did "cargo bench --verbose" and repeated a few "rustc" commands with "-C target-cpu=native" added. Normal: http://pastebin.com/4PGJMkSv Native: http://pastebin.com/EbJKjaVT

What I've noticed (normal vs. native):

test aaab_in_aab::twoway_rfind ... bench: 1,503,030 ns/iter (+/- 5,837) = 199 MB/s test aaab_in_aab::twoway_rfind ... bench: 857,109 ns/iter (+/- 1,914) = 349 MB/s

test pathological_two_way_rev::twoway_rfind ... bench: 308,092 ns/iter (+/- 382) = 194 MB/s test pathological_two_way_rev::twoway_rfind ... bench: 150,717 ns/iter (+/- 255) = 398 MB/s

Might be just quirks, I haven't examined the assembly. Just my two cents.

The byteset check is random access of a single byte

Yeah, I can see how this would be hard for LLVM to optimize. I rather hoped that the scratchspace contained some naive searchers that we'd see turned by LLVM into SSE instructions.

bluss commented 9 years ago

cool ok, I have to look at tht!

bluss commented 9 years ago

I'm only using one particular platform (llvm calls it corei7-avx), so of course that is kind of a one eyed view on optimization.

ArtemGr commented 9 years ago

Well, give me a shout if you want to run some Opteron-3365 benchmarks.

bluss commented 9 years ago

Ok that's exciting, with native it seems to emit some avx instructions here, but yes, I think it's in a location that's outside of the innermost search loop. Getting it to emit a vectorized loop for the first counted loop (match first half of needle) would be great.

bluss commented 9 years ago

Alright, I guess I should think about multiple different platforms and microarches at some point

dralley commented 2 years ago

@bluss If you still happen to have this benchmarking code (repo is longer available), it would be great to adapt for https://github.com/shepmaster/jetscii/issues/54