Poor performance when using composition of core numeric types

emiliopaolini commented 3 years ago

Hi everyone, I have a question: is it possible to speed-up the runtime when using composition of core numeric types. For example, I am trying to use this type: template <int Digits,int Exponent,class Narrowest=int> using saturated_elastic_scaled_integer = cnl::scaled_integer< cnl::overflow_integer< cnl::elastic_integer< Digits,cnl::wide_integer<cnl::digits_v<int>, Narrowest>>, cnl::saturated_overflow_tag>, cnl::power<Exponent>>; but it is very very slow. For example, the same code using float requires something like 5 minutes, while using this type it requires more than 3 hours. Is it possible to do something? Maybe i am doing something wrong with the construction of core types. Thank you very much!

johnmcfarlane commented 3 years ago

Hi. I can only guess without seeing the code but you're using most of the facilities of the library and some of them incur run-time costs of one kind or another.

Are you sure saturation is what you want? It often produces completely wrong results: it's not 'sticky' like floating-point saturation. Can you not simply avoid overflow instead? Try using the trapping tag instead and see if there is actually any overflow and then see why that is.
How many digits are actually being used? If it's greater than 63, you may be invoking multi-word wide_integer types. If it's over 127, you almost certainly are.
I have to ask: do you have optimisations enabled, in particular inlining?

emiliopaolini commented 3 years ago

Hi.

Yes, I would need saturation. Maybe i will try to use the overflow integer, and just let the program run without it to see if there are any improvements.
I am actually using this: using float_t = saturated_elastic_scaled_integer<5, -4>;
Yes, i have optimizations enabled!

Do you have any other suggestions? Maybe i can try to avoid using wide_integer too? Thank you very much!!

johnmcfarlane commented 3 years ago

With numbers that small, wide_integer shouldn't affect things although if your compiler is unable to inline code aggressively enough, it might get in the way of optimisation. You certainly shouldn't need it. You can test this by removing it and seeing if there are compiler errors.

But really, saturated_overflow_tag is probably your biggest problem. Try replacing with trapping_overflow_tag. If that runs without error, try undefined_overflow_tag to see if that improves performance. Currently, you're testing many operations for overflow -- many of which cannot overflow and all of which arguably shouldn't overflow. When overflow occurs, errors of arbitrary scale creep in, often accumulating over time.

johnmcfarlane commented 3 years ago

Can I ask if you're emulating saturating hardware? If so, does every last operation saturate?

emiliopaolini commented 3 years ago

Hi, I did some tests and removing overflow_integer made the simulations run faster. Now it requires more or less 30 minutes! So it is a good time.
Regarding the hardware, yes. I am trying to emulate a specific hardware and I would need every operation that i perform to stay on 6 bits let's say. That's why i was using overflow_integer with the saturated_overflow_tag. Are there any alternatives?

johnmcfarlane commented 3 years ago

When you use float, how is are the values being saturated?

emiliopaolini commented 3 years ago

Maybe it is better if i explain the scenario on which i am working. I am currently using the types provided by CNL inside Tiny-DNN, that is a library for deep learning in C++. I am training my models using float and then i have to perform the inference using fixed-point (to simulate the hardware, since for now it is only possible to do inference and not training). So i am not saturating when i use float, instead when i use fixed-point i would need the values to saturate.

johnmcfarlane commented 3 years ago

Try to use types with just enough range to always avoid overflow and reduce incidence of conversion between types with different scales. For example, if b has non-zero exponent, the following incurs a scaling operation:

a = a * b;

To that end, use auto to try and avoid unnecessary conversion:

auto c = a * b;  // guaranteed not to scale

Division can be more expensive for integers than for floats. So try and produce an inverse against which you can multiply many values. If that inverse has the right exponent, you will also avoid a scaling operation and get much more efficient code.

Think about other ways to avoid expensive calculations like replacing sqrt(distance) < radius with distance*distance < radius*radius. Remember that elastic_integer helps you avoid the overflow by widening which is generally cheaper than testing for overflow and far more accurate than saturating.

Perhaps if you can extract the inner loop and get it building on Compiler Explorer (example). If it's a simple piece of code, I'd be happy to take a quick look. Otherwise, it's still a lot of guesswork.

Remember you're using a fraction of the silicon. The downside is that much of the burden for handling range rests with you.

emiliopaolini commented 3 years ago

I think i will go with elastic_integer! Thank you for all the suggestions!!

johnmcfarlane commented 3 years ago

YW Bear in mind the width of results will exceed the width of operands for operations like multiply and add.

On Thu 15 Jul 2021, 07:28 emiliopaolini, @.***> wrote:

I think i will go with elastic_integer! Thank you for all the suggestions!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/johnmcfarlane/cnl/issues/913#issuecomment-880432820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTGN7ZF65ZTAJIR2CIRATTXZ5Z3ANCNFSM5AG6XSOA .

emiliopaolini commented 3 years ago

Thank you! But for example, if i multiply two numbers and then assign the result to an elastic_scaled_integer of 6 bits, what happens? Will the result be converted to satisfy the constraint of 6 bits and it will lose resolution right?

Il giorno gio 15 lug 2021 alle ore 09:40 John McFarlane < @.***> ha scritto:

YW Bear in mind the width of results will exceed the width of operands for operations like multiply and add.

On Thu 15 Jul 2021, 07:28 emiliopaolini, @.***> wrote:

I think i will go with elastic_integer! Thank you for all the suggestions!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/johnmcfarlane/cnl/issues/913#issuecomment-880432820 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFTGN7ZF65ZTAJIR2CIRATTXZ5Z3ANCNFSM5AG6XSOA

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/johnmcfarlane/cnl/issues/913#issuecomment-880470568, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQEAHXIAC3LUV67LEO5B7TTX2GF7ANCNFSM5AG6XSOA .

johnmcfarlane commented 3 years ago

Yes, using the simplest conversion, it will scale up or down as necessary. The lower digits will be lost, i.e. precision loss. (That's kind-of inevitable.) If the number was too big to fit, you will get erroneous results.

One other thing that might help performance: use overflow_integer with its default tag (undefined) and in an optimised build, define NDEBUG. This tells the optimiser to assume there's no overflow.

Example: https://godbolt.org/z/sxYqvjanv

johnmcfarlane / cnl

Poor performance when using composition of core numeric types #913