huonw / hamming

Compute the Hamming weight of a vector and the Hamming distance between two efficiently
http://huonw.github.io/hamming/hamming
Apache License 2.0
7 stars 5 forks source link

Aligned approach is faster when using -C target-features=+popcnt #2

Open taralx opened 8 years ago

taralx commented 8 years ago
test weight_::benches::aligned::_0000001     ... bench:           2 ns/iter (+/- 0)
test weight_::benches::aligned::_0000010     ... bench:           6 ns/iter (+/- 0)
test weight_::benches::aligned::_0000100     ... bench:          13 ns/iter (+/- 1)
test weight_::benches::aligned::_0001000     ... bench:          41 ns/iter (+/- 1)
test weight_::benches::aligned::_0010000     ... bench:         335 ns/iter (+/- 41)
test weight_::benches::aligned::_0100000     ... bench:       3,640 ns/iter (+/- 1,346)
test weight_::benches::aligned::_1000000     ... bench:      34,018 ns/iter (+/- 5,687)
test weight_::benches::naive::_0000001       ... bench:           1 ns/iter (+/- 1)
test weight_::benches::naive::_0000010       ... bench:           7 ns/iter (+/- 1)
test weight_::benches::naive::_0000100       ... bench:          63 ns/iter (+/- 18)
test weight_::benches::naive::_0001000       ... bench:         607 ns/iter (+/- 222)
test weight_::benches::naive::_0010000       ... bench:       6,309 ns/iter (+/- 1,162)
test weight_::benches::naive::_0100000       ... bench:      64,058 ns/iter (+/- 19,585)
test weight_::benches::naive::_1000000       ... bench:     630,105 ns/iter (+/- 67,243)
test weight_::benches::weight::_0000001      ... bench:           5 ns/iter (+/- 2)
test weight_::benches::weight::_0000010      ... bench:          12 ns/iter (+/- 5)
test weight_::benches::weight::_0000100      ... bench:          65 ns/iter (+/- 2)
test weight_::benches::weight::_0001000      ... bench:         165 ns/iter (+/- 7)
test weight_::benches::weight::_0010000      ... bench:       1,478 ns/iter (+/- 78)
test weight_::benches::weight::_0100000      ... bench:      15,060 ns/iter (+/- 1,904)
test weight_::benches::weight::_1000000      ... bench:     152,544 ns/iter (+/- 62,516)
fn aligned(x: &[u8]) -> u64 {
    let size = mem::size_of::<u64>();
    let alignment = mem::align_of::<u64>();

    let ptr = x.as_ptr() as usize;
    let aligned = (ptr + alignment - 1) / alignment * alignment;
    let distance = aligned - ptr;
    if x.len() < size + distance {
        return naive(x)
    }
    let (head, middle) = x.split_at(distance);

    assert!(middle.as_ptr() as usize % alignment == 0);
    let midslice = unsafe {
        slice::from_raw_parts(middle.as_ptr() as *const u64,
                              middle.len() / size)
    };
    let tail = &middle[midslice.len() * size..];
    let count = naive(head);
    let count = midslice.iter().fold(count, |a, b| a + b.count_ones() as u64);
    count + naive(tail)
}
taralx commented 8 years ago

(This is basically like weight, but using u64 instead of T30 and count_ones instead of a complicated thing.)

huonw commented 8 years ago

Nice speed-up! Unfortunately, I can't quite reproduce such a large difference:

test weight_::benches::aligned::_0000001     ... bench:           1 ns/iter (+/- 1)
test weight_::benches::aligned::_0000010     ... bench:           5 ns/iter (+/- 0)
test weight_::benches::aligned::_0000100     ... bench:          12 ns/iter (+/- 0)
test weight_::benches::aligned::_0001000     ... bench:         102 ns/iter (+/- 1)
test weight_::benches::aligned::_0010000     ... bench:       1,063 ns/iter (+/- 51)
test weight_::benches::aligned::_0100000     ... bench:      10,611 ns/iter (+/- 170)
test weight_::benches::aligned::_1000000     ... bench:     106,142 ns/iter (+/- 2,095)
test weight_::benches::naive::_0000001       ... bench:           1 ns/iter (+/- 0)
test weight_::benches::naive::_0000010       ... bench:           7 ns/iter (+/- 0)
test weight_::benches::naive::_0000100       ... bench:          75 ns/iter (+/- 31)
test weight_::benches::naive::_0001000       ... bench:         679 ns/iter (+/- 25)
test weight_::benches::naive::_0010000       ... bench:       6,639 ns/iter (+/- 255)
test weight_::benches::naive::_0100000       ... bench:      66,877 ns/iter (+/- 2,478)
test weight_::benches::naive::_1000000       ... bench:     673,277 ns/iter (+/- 20,835)
test weight_::benches::weight::_0000001      ... bench:           4 ns/iter (+/- 1)
test weight_::benches::weight::_0000010      ... bench:          10 ns/iter (+/- 1)
test weight_::benches::weight::_0000100      ... bench:          69 ns/iter (+/- 2)
test weight_::benches::weight::_0001000      ... bench:         164 ns/iter (+/- 1)
test weight_::benches::weight::_0010000      ... bench:       1,423 ns/iter (+/- 52)
test weight_::benches::weight::_0100000      ... bench:      13,379 ns/iter (+/- 485)
test weight_::benches::weight::_1000000      ... bench:     133,132 ns/iter (+/- 6,321)

I'm... not sure the best way to handle this at the moment: we'd basically need either static detection of -C target-feature=+popcnt (needs compiler support), or dynamic detection and use of popcnt (the first part is easy enough via the various cpuid libraries that exist, but the latter is not so easy: no way to have scoped features enabled for code generation).

I suppose there could be a manual feature that people could enable (like hamming/popcnt) if they're also enabling the instruction, but this isn't as nice as doing it automatically.

huonw commented 8 years ago

Oh, I do get the bigger speed-up with -C target-cpu=native instead of just -C target-feature=+popcnt (on an intel haswell chip, i7-4900MQ):

test weight_::benches::aligned::_0000001     ... bench:           3 ns/iter (+/- 0)
test weight_::benches::aligned::_0000010     ... bench:           7 ns/iter (+/- 1)
test weight_::benches::aligned::_0000100     ... bench:          13 ns/iter (+/- 0)
test weight_::benches::aligned::_0001000     ... bench:          38 ns/iter (+/- 15)
test weight_::benches::aligned::_0010000     ... bench:         383 ns/iter (+/- 1,078)
test weight_::benches::aligned::_0100000     ... bench:       3,006 ns/iter (+/- 70)
test weight_::benches::aligned::_1000000     ... bench:      31,553 ns/iter (+/- 660)
test weight_::benches::naive::_0000001       ... bench:           2 ns/iter (+/- 0)
test weight_::benches::naive::_0000010       ... bench:           8 ns/iter (+/- 1)
test weight_::benches::naive::_0000100       ... bench:          58 ns/iter (+/- 2)
test weight_::benches::naive::_0001000       ... bench:         552 ns/iter (+/- 165)
test weight_::benches::naive::_0010000       ... bench:       5,470 ns/iter (+/- 86)
test weight_::benches::naive::_0100000       ... bench:      54,237 ns/iter (+/- 906)
test weight_::benches::naive::_1000000       ... bench:     544,888 ns/iter (+/- 19,701)
test weight_::benches::weight::_0000001      ... bench:           5 ns/iter (+/- 0)
test weight_::benches::weight::_0000010      ... bench:           9 ns/iter (+/- 0)
test weight_::benches::weight::_0000100      ... bench:          58 ns/iter (+/- 2)
test weight_::benches::weight::_0001000      ... bench:         147 ns/iter (+/- 2)
test weight_::benches::weight::_0010000      ... bench:       1,300 ns/iter (+/- 70)
test weight_::benches::weight::_0100000      ... bench:      12,199 ns/iter (+/- 49,290)
test weight_::benches::weight::_1000000      ... bench:     120,848 ns/iter (+/- 2,189)
taralx commented 8 years ago

Ah, sorry about that. I did run with native CPU, and neglected to check my assumptions.

taralx commented 8 years ago

Good to know that popcnt alone is enough to make aligned better than weight, though. Means we can just test for the feature being enabled (somehow?).

taralx commented 8 years ago

Interesting, on Intel Xeon E5520 (2.27GHz) running Ubuntu just +popcnt (no cpu directive) yields the full speedup. (I originally tested on my macbook as well.)

taralx commented 8 years ago

Even without +popcnt, we get wins doing things u64-at-a-time:

test weight_::benches::aligned::_0000001     ... bench:           7 ns/iter (+/- 0)
test weight_::benches::aligned::_0000010     ... bench:          20 ns/iter (+/- 1)
test weight_::benches::aligned::_0000100     ... bench:          60 ns/iter (+/- 2)
test weight_::benches::aligned::_0001000     ... bench:         402 ns/iter (+/- 7)
test weight_::benches::aligned::_0010000     ... bench:       3,920 ns/iter (+/- 101)
test weight_::benches::aligned::_0100000     ... bench:      39,347 ns/iter (+/- 1,200)
test weight_::benches::aligned::_1000000     ... bench:     398,171 ns/iter (+/- 7,491)
test weight_::benches::naive::_0000001       ... bench:           6 ns/iter (+/- 0)
test weight_::benches::naive::_0000010       ... bench:          32 ns/iter (+/- 1)
test weight_::benches::naive::_0000100       ... bench:         290 ns/iter (+/- 3)
test weight_::benches::naive::_0001000       ... bench:       2,913 ns/iter (+/- 39)
test weight_::benches::naive::_0010000       ... bench:      29,035 ns/iter (+/- 586)
test weight_::benches::naive::_0100000       ... bench:     290,994 ns/iter (+/- 3,047)
test weight_::benches::naive::_1000000       ... bench:   2,911,485 ns/iter (+/- 31,979)
test weight_::benches::weight::_0000001      ... bench:          10 ns/iter (+/- 1)
test weight_::benches::weight::_0000010      ... bench:          35 ns/iter (+/- 0)
test weight_::benches::weight::_0000100      ... bench:         294 ns/iter (+/- 9)
test weight_::benches::weight::_0001000      ... bench:         382 ns/iter (+/- 14)
test weight_::benches::weight::_0010000      ... bench:       3,017 ns/iter (+/- 140)
test weight_::benches::weight::_0100000      ... bench:      26,086 ns/iter (+/- 1,054)
test weight_::benches::weight::_1000000      ... bench:     258,427 ns/iter (+/- 10,218)
huonw commented 8 years ago

Means we can just test for the feature being enabled (somehow?).

Nope, the compiler support I mentioned above doesn't currently exist.

It'd definitely make sense to have a patch that makes naive == aligned (and have a byte function for comparison/the base case).