Interesting optimization and other observations

Safety0ff commented 10 years ago

Hello, I'm preparing to partially port libdivide to D and have some observations about the current implementation:

You use a custom bit field instead of a standard C bit field, there's no benefit, it just makes your code more verbose and error prone. e.g.

struct libdivide_s32_t {
    uint32_t magic;
    uint8_t shift : 5;
    bool shiftpath : 1;
    bool add : 1;
    int8_t sign : 1;
};

Note that not only is it self documenting, but using a signed type removes the need for casting to make sure you get an arithmetic shift followed by a sign extension!

For the implementation of unsigned division, it is usually better to compute both the shift path and the multiply path and masking out the unneeded result (for pipelined CPU.) If addition is required only then you use a branch. Also, because x86 can quickly sign extend a byte value from high/low byte registers, we store the mask as a -1 int8_t. Using extra bytes is free since the compiler pads the structure to a multiple of the magic type size anyway.

I've implemented a proof of concept in my fork [1], it is not complete though (I'd rather spend time on my port.) It gives roughly a 2x speed up. Unswitching now only gives roughly 17% improvement over non-unswitched, so its a debatable optimization (91% faster versus 89% faster than system division.) I hope I didn't create a bug in the process, since I only checked the benchmark because modifying the tester would require too many changes. I also added branch hints using __builtin_expect

[1] https://github.com/Safety0ff/libdivide

P.S: Pass by value would probably have been better than pass-by-pointer. GCC's optimizer seems good enough that it does not have any impact on performance.

ridiculousfish commented 10 years ago

Exciting news!

Here's my observations on your observations:

My memory is that I tested bitfields and found some compilers did not optimize them as effectively, e.g. emitting separate loads for separate fields. I don't know to what extent that remains true. Also, signed bit fields are frightening, such as the bizarre behavior of int8_t sign:1 that can't store 1. IMHO it's best to avoid signed bitfields entirely unless there's a perf win.

Regarding extra padding bytes: I hoped there may not be padding if it was embedded in a larger struct with subsequent fields:

struct { libdivide_u32_t d; uint8_t other1; uint8_t other2; ...

but it appears I was mistaken and that requires adding the __packed attribute. Anyways if you can find a speedup with those extra three bytes, then great; otherwise maybe we declare the structure packed.

I am skeptical of the branch-free solution, since the branch ought to be correctly predicted nearly all the time. I tried your branch and the benchmark reports a slowdown. Love to be proven wrong on this though.

As for pass-by-value, my memory is this made no difference since the functions were invariably inlined.

Also note: gcc 4.9 is amazing and will vectorize the high multiplies, beating clang by 50%. However that makes the benchmark somewhat less realistic. gcc 4.8 will also vectorize the high multiplies, but incorrectly. So be aware of that in your testing - perhaps disable the vectorizer for the benchmark.

Overall I'm delighted someone has picked this up. Can't wait to see what else you find.

Safety0ff commented 10 years ago

Exciting news!

I was excited when I wrote the title, I've adjusted it to match this response.

My memory is that I tested bit fields and found some compilers did not optimize them as effectively

Fair enough. I suppose bit fields are seldom enough that some compilers neglect them.

Regarding extra padding bytes: I hoped there may not be padding if it was embedded in a larger struct with subsequent fields: [Snip] that requires adding the __packed attribute.

Yea, it's rare to see tail padding optimization without manually asking the compiler for it.

Also note: gcc 4.9 is amazing and will vectorize the high multiplies, beating clang by 50%. However that makes the benchmark somewhat less realistic.

Hmm, you're right, turning optimization down to O2 makes it a 25% performance loss. I'll have to check the assembly output.

With clang my code is a loss and your unswitching does not yield much benefit.

This was an interesting experiment never-the-less.

ridiculousfish / libdivide

Interesting optimization and other observations #4