Open taralx opened 8 years ago
(This is basically like weight, but using u64 instead of T30 and count_ones instead of a complicated thing.)
Nice speed-up! Unfortunately, I can't quite reproduce such a large difference:
test weight_::benches::aligned::_0000001 ... bench: 1 ns/iter (+/- 1)
test weight_::benches::aligned::_0000010 ... bench: 5 ns/iter (+/- 0)
test weight_::benches::aligned::_0000100 ... bench: 12 ns/iter (+/- 0)
test weight_::benches::aligned::_0001000 ... bench: 102 ns/iter (+/- 1)
test weight_::benches::aligned::_0010000 ... bench: 1,063 ns/iter (+/- 51)
test weight_::benches::aligned::_0100000 ... bench: 10,611 ns/iter (+/- 170)
test weight_::benches::aligned::_1000000 ... bench: 106,142 ns/iter (+/- 2,095)
test weight_::benches::naive::_0000001 ... bench: 1 ns/iter (+/- 0)
test weight_::benches::naive::_0000010 ... bench: 7 ns/iter (+/- 0)
test weight_::benches::naive::_0000100 ... bench: 75 ns/iter (+/- 31)
test weight_::benches::naive::_0001000 ... bench: 679 ns/iter (+/- 25)
test weight_::benches::naive::_0010000 ... bench: 6,639 ns/iter (+/- 255)
test weight_::benches::naive::_0100000 ... bench: 66,877 ns/iter (+/- 2,478)
test weight_::benches::naive::_1000000 ... bench: 673,277 ns/iter (+/- 20,835)
test weight_::benches::weight::_0000001 ... bench: 4 ns/iter (+/- 1)
test weight_::benches::weight::_0000010 ... bench: 10 ns/iter (+/- 1)
test weight_::benches::weight::_0000100 ... bench: 69 ns/iter (+/- 2)
test weight_::benches::weight::_0001000 ... bench: 164 ns/iter (+/- 1)
test weight_::benches::weight::_0010000 ... bench: 1,423 ns/iter (+/- 52)
test weight_::benches::weight::_0100000 ... bench: 13,379 ns/iter (+/- 485)
test weight_::benches::weight::_1000000 ... bench: 133,132 ns/iter (+/- 6,321)
I'm... not sure the best way to handle this at the moment: we'd basically need either static detection of -C target-feature=+popcnt
(needs compiler support), or dynamic detection and use of popcnt (the first part is easy enough via the various cpuid libraries that exist, but the latter is not so easy: no way to have scoped features enabled for code generation).
I suppose there could be a manual feature that people could enable (like hamming/popcnt
) if they're also enabling the instruction, but this isn't as nice as doing it automatically.
Oh, I do get the bigger speed-up with -C target-cpu=native
instead of just -C target-feature=+popcnt
(on an intel haswell chip, i7-4900MQ):
test weight_::benches::aligned::_0000001 ... bench: 3 ns/iter (+/- 0)
test weight_::benches::aligned::_0000010 ... bench: 7 ns/iter (+/- 1)
test weight_::benches::aligned::_0000100 ... bench: 13 ns/iter (+/- 0)
test weight_::benches::aligned::_0001000 ... bench: 38 ns/iter (+/- 15)
test weight_::benches::aligned::_0010000 ... bench: 383 ns/iter (+/- 1,078)
test weight_::benches::aligned::_0100000 ... bench: 3,006 ns/iter (+/- 70)
test weight_::benches::aligned::_1000000 ... bench: 31,553 ns/iter (+/- 660)
test weight_::benches::naive::_0000001 ... bench: 2 ns/iter (+/- 0)
test weight_::benches::naive::_0000010 ... bench: 8 ns/iter (+/- 1)
test weight_::benches::naive::_0000100 ... bench: 58 ns/iter (+/- 2)
test weight_::benches::naive::_0001000 ... bench: 552 ns/iter (+/- 165)
test weight_::benches::naive::_0010000 ... bench: 5,470 ns/iter (+/- 86)
test weight_::benches::naive::_0100000 ... bench: 54,237 ns/iter (+/- 906)
test weight_::benches::naive::_1000000 ... bench: 544,888 ns/iter (+/- 19,701)
test weight_::benches::weight::_0000001 ... bench: 5 ns/iter (+/- 0)
test weight_::benches::weight::_0000010 ... bench: 9 ns/iter (+/- 0)
test weight_::benches::weight::_0000100 ... bench: 58 ns/iter (+/- 2)
test weight_::benches::weight::_0001000 ... bench: 147 ns/iter (+/- 2)
test weight_::benches::weight::_0010000 ... bench: 1,300 ns/iter (+/- 70)
test weight_::benches::weight::_0100000 ... bench: 12,199 ns/iter (+/- 49,290)
test weight_::benches::weight::_1000000 ... bench: 120,848 ns/iter (+/- 2,189)
Ah, sorry about that. I did run with native CPU, and neglected to check my assumptions.
Good to know that popcnt alone is enough to make aligned better than weight, though. Means we can just test for the feature being enabled (somehow?).
Interesting, on Intel Xeon E5520 (2.27GHz) running Ubuntu just +popcnt (no cpu directive) yields the full speedup. (I originally tested on my macbook as well.)
Even without +popcnt, we get wins doing things u64-at-a-time:
test weight_::benches::aligned::_0000001 ... bench: 7 ns/iter (+/- 0)
test weight_::benches::aligned::_0000010 ... bench: 20 ns/iter (+/- 1)
test weight_::benches::aligned::_0000100 ... bench: 60 ns/iter (+/- 2)
test weight_::benches::aligned::_0001000 ... bench: 402 ns/iter (+/- 7)
test weight_::benches::aligned::_0010000 ... bench: 3,920 ns/iter (+/- 101)
test weight_::benches::aligned::_0100000 ... bench: 39,347 ns/iter (+/- 1,200)
test weight_::benches::aligned::_1000000 ... bench: 398,171 ns/iter (+/- 7,491)
test weight_::benches::naive::_0000001 ... bench: 6 ns/iter (+/- 0)
test weight_::benches::naive::_0000010 ... bench: 32 ns/iter (+/- 1)
test weight_::benches::naive::_0000100 ... bench: 290 ns/iter (+/- 3)
test weight_::benches::naive::_0001000 ... bench: 2,913 ns/iter (+/- 39)
test weight_::benches::naive::_0010000 ... bench: 29,035 ns/iter (+/- 586)
test weight_::benches::naive::_0100000 ... bench: 290,994 ns/iter (+/- 3,047)
test weight_::benches::naive::_1000000 ... bench: 2,911,485 ns/iter (+/- 31,979)
test weight_::benches::weight::_0000001 ... bench: 10 ns/iter (+/- 1)
test weight_::benches::weight::_0000010 ... bench: 35 ns/iter (+/- 0)
test weight_::benches::weight::_0000100 ... bench: 294 ns/iter (+/- 9)
test weight_::benches::weight::_0001000 ... bench: 382 ns/iter (+/- 14)
test weight_::benches::weight::_0010000 ... bench: 3,017 ns/iter (+/- 140)
test weight_::benches::weight::_0100000 ... bench: 26,086 ns/iter (+/- 1,054)
test weight_::benches::weight::_1000000 ... bench: 258,427 ns/iter (+/- 10,218)
Means we can just test for the feature being enabled (somehow?).
Nope, the compiler support I mentioned above doesn't currently exist.
It'd definitely make sense to have a patch that makes naive
== aligned
(and have a byte
function for comparison/the base case).