Add benchmarks to source code

shepmaster / jetscii

A tiny library to efficiently search strings for sets of ASCII characters and byte slices for sets of bytes.

Apache License 2.0

113 stars 21 forks source link

Add benchmarks to source code #54

Open dralley opened 2 years ago

dralley commented 2 years ago

The documentation shares some benchmarks, which is great. But for transparency, and also to make it easier for users to run said benchmarks on their machine and determine what works best for their hardware, it would also be useful to have the benchmarks available in this repo.

Additionally it would be great to test against crates such as memchr.

Another user had posted some code previously, but it is no longer availble https://github.com/shepmaster/jetscii/issues/11

shepmaster commented 2 years ago

They exist:

https://github.com/shepmaster/jetscii/blob/868b04c3bdd3b096664ac43168976e126f38cb38/src/lib.rs#L349-L350

cargo +nightly bench --features benchmarks

Additionally it would be great to test against crates such as memchr.

Certainly! Feel free to add it as a dev-dependency and add it to the benchmarks.

dralley commented 2 years ago

Ah, I was looking for a separate directory as is typically done and didn't see them. Sorry for the confusion.

Quick question though. I tried to use jetscii to accelerate an XML parsing library, in particular to do escaping of text, and the results were a little disappointing as it was only 50-75% faster in the ideal case and worse on short inputs. Is that typical?

I've read that pcmpestrm is slower than pcmpistrm and that hardware makers don't tend to prioritize either of them very that much, which sounds kind of unfortunate if true.

https://github.com/tafia/quick-xml/pull/408

shepmaster commented 2 years ago

as is typically done

You'll note that this repo is old and predates a number of now-common patterns. 😉

I tried to use jetscii to accelerate an XML parsing library

That would be the reason that I created it. :-)

only 50-75% faster in the ideal case and worse on short inputs

I'm no hardware guru, but those numbers make sense to me. The SIMD parts of the processor are "big and heavy" and use a disproportionate amount of power. Some recent processors even stopped including some units like AVX-512 for related reasons.

(Side note: "X% faster" is not the clearest way of stating performance changes. Prefer "X% of previous speed" or even better showing absolute before and after numbers. I parse "50% faster" as you went from e.g. 100B/sec to 150B/sec)

I've read that pcmpestrm is slower than pcmpistrm

I had not heard that; do you have any links to share?

hardware makers don't tend to prioritize either of them

That wouldn't surprise me with the whole power thing.

dralley commented 2 years ago

I had not heard that; do you have any links to share?

Yeah. Unfortunately it seems to be true. The variants that are used with C strings got all the love : /

https://uops.info/table.html

https://stackoverflow.com/questions/20935769/sse42-sttni-pcmpestrm-is-twice-slower-than-pcmpistrm-is-it-true

https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp

The comment from burntsushi and the Intel guy here https://news.ycombinator.com/item?id=14422098

Dr-Emann commented 1 year ago

This should probably be closed if #57 is merged, since it allows cargo bench to work directly, and moves the benchmarks to a separate folder