ridiculousfish / libdivide

Official git repository for libdivide: optimized integer division
http://libdivide.com
Other
1.09k stars 77 forks source link

Interesting optimization and other observations #4

Closed Safety0ff closed 10 years ago

Safety0ff commented 10 years ago

Hello, I'm preparing to partially port libdivide to D and have some observations about the current implementation:

struct libdivide_s32_t {
    uint32_t magic;
    uint8_t shift : 5;
    bool shiftpath : 1;
    bool add : 1;
    int8_t sign : 1;
};

Note that not only is it self documenting, but using a signed type removes the need for casting to make sure you get an arithmetic shift followed by a sign extension!

I've implemented a proof of concept in my fork [1], it is not complete though (I'd rather spend time on my port.) It gives roughly a 2x speed up. Unswitching now only gives roughly 17% improvement over non-unswitched, so its a debatable optimization (91% faster versus 89% faster than system division.) I hope I didn't create a bug in the process, since I only checked the benchmark because modifying the tester would require too many changes. I also added branch hints using __builtin_expect

[1] https://github.com/Safety0ff/libdivide

P.S: Pass by value would probably have been better than pass-by-pointer. GCC's optimizer seems good enough that it does not have any impact on performance.

ridiculousfish commented 10 years ago

Exciting news!

Here's my observations on your observations:

My memory is that I tested bitfields and found some compilers did not optimize them as effectively, e.g. emitting separate loads for separate fields. I don't know to what extent that remains true. Also, signed bit fields are frightening, such as the bizarre behavior of int8_t sign:1 that can't store 1. IMHO it's best to avoid signed bitfields entirely unless there's a perf win.

Regarding extra padding bytes: I hoped there may not be padding if it was embedded in a larger struct with subsequent fields:

struct { libdivide_u32_t d; uint8_t other1; uint8_t other2; ...

but it appears I was mistaken and that requires adding the __packed attribute. Anyways if you can find a speedup with those extra three bytes, then great; otherwise maybe we declare the structure packed.

I am skeptical of the branch-free solution, since the branch ought to be correctly predicted nearly all the time. I tried your branch and the benchmark reports a slowdown. Love to be proven wrong on this though.

As for pass-by-value, my memory is this made no difference since the functions were invariably inlined.

Also note: gcc 4.9 is amazing and will vectorize the high multiplies, beating clang by 50%. However that makes the benchmark somewhat less realistic. gcc 4.8 will also vectorize the high multiplies, but incorrectly. So be aware of that in your testing - perhaps disable the vectorizer for the benchmark.

Overall I'm delighted someone has picked this up. Can't wait to see what else you find.

Safety0ff commented 10 years ago

Exciting news!

I was excited when I wrote the title, I've adjusted it to match this response.

My memory is that I tested bit fields and found some compilers did not optimize them as effectively

Fair enough. I suppose bit fields are seldom enough that some compilers neglect them.

Regarding extra padding bytes: I hoped there may not be padding if it was embedded in a larger struct with subsequent fields: [Snip] that requires adding the __packed attribute.

Yea, it's rare to see tail padding optimization without manually asking the compiler for it.

Also note: gcc 4.9 is amazing and will vectorize the high multiplies, beating clang by 50%. However that makes the benchmark somewhat less realistic.

Hmm, you're right, turning optimization down to O2 makes it a 25% performance loss. I'll have to check the assembly output.

With clang my code is a loss and your unswitching does not yield much benefit.

This was an interesting experiment never-the-less.