shibatch / sleef

SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT
https://sleef.org
Boost Software License 1.0
657 stars 131 forks source link

Large performance diff between sin u10 and u35 #107

Open xoofx opened 6 years ago

xoofx commented 6 years ago

Hey, I just had a quick look a bit more at the performance of sleef and tried just a simple benchmark with Sleef_sinf_u10 and Sleef_sinf_u35, compare also against a stock sin with MSVC compiler:

The bench is something in the line of:

float BenchFloat_Sin(unsigned long count)
{
    float value = 0;
    float f = 5;
    for (unsigned long i = 0; i < count; i++)
    {
        value += sin(f);
        value += sin(f + 1);
        value += sin(f + 2);
        value += sin(f + 3);
        value += sin(f + 4);
        f += i;
    }
    return value;
}

where the count was dynamically calibrated from another bench, but in that case on my machine was set to 6574622.

The timing results are:

MSVC sinf: 418ms
Sleef_sinf_u10: 2058ms
Sleef_sinf_u35: 179ms

between u10 and u35, there is more than x10 different (!)... I haven't checked the accuracy of MSVC sinf, so likely it is not 1 ULP. Looks like exponential to get 1ULP.

Do you think this results are expected?

Also, If it is not possible to achieve 1ULP without this cost, would it be possible to introduce an intermediate precision (ULP 2?) that could lower the gap there and make it more viable?

More generally, I was also wondering if adding also more ULP precisions choices (u10, u20, u35, u60) could be helpful in some cases...

shibatch commented 6 years ago

The scalar functions are not seriously optimized. They are provided for easier understanding of the algorithms.

As you know, I am planning to make the scalar functions use AVX2 or the best extension available on the computer, and that should resolve this problem. After this feature will be implemented, the computation speed of the scalar function will be almost same as the 128-bit SIMD version.

xoofx commented 6 years ago

As you know, I am planning to make the scalar functions use AVX2 or the best extension available on the computer, and that should resolve this problem.

But from what the assembler output gives, it looks like less a problem of ASM instructions than a complexity of the ULP10 algorithm vs ULP35. You can check in this gist.

xoofx commented 6 years ago

From my tests also, the f4 version suffers the same issue actually (in the f4 version, the ss instructions are replaced by ps instructions, but they are basically the same)

shibatch commented 6 years ago

The main reason is that availability of FMA affects the performance significantly. By seeing the assembly output, I think the difference is understandable. Compare the number of add and sub.

xoofx commented 6 years ago

Yeah, exactly, that's why it is quite shocking, if you look at the number of ASM instructions:

u10: 385 u35: 75

Even with FMA, if the price of U10 is that exponential, I would rather trade it for a U20 if it improves performance by more than 2x

shibatch commented 6 years ago

Did you see the benchmark result? The difference is less than x1.5 with AVX2.

http://sleef.org/benchmark.xhtml

xoofx commented 6 years ago

The difference is less than x1.5 with AVX2.

Not sure which difference you are referring to? (also was not able to find AVX2 results actually...)

But again, what do you think of providing a U20 that would not cost this exponential cost of the U10? I don't really like the idea of using only AVX2 to have better performance, while we can also leverage on the complexity of the approximation used...

shibatch commented 6 years ago

Since the benchmarking tool is executed on Core i7-6700, it automatically selects AVX2 implementation. The graph shows the execution time for Sleef_sind4_u10 is 0.015us (blue bar). The execution time for Sleef_sind4_u35 is 0.011us.

It does not make sense to implement u20 version since there is not much difference between u10 and u35, if FMA is available. Could you explain why you need fast execution on computers without FMA?

shibatch commented 6 years ago

It is also hard to say if u20 version will be faster than u10. I also need to check what accuracy the u35 version is computing at. I am basically only checking if the error of u35 versions is less than 3.5 ULPs. They are actually more accurate.

xoofx commented 6 years ago

Could you explain why you need fast execution on computers without FMA?

Because afaik, AVX2 just started with Haswell since 2014, so it is very recent and many computers out there are not able to run this. Check the hardware of Unity games, CPU at the bottom, you will see that unfortunately we don't have the luxury to leverage on AVX2 (we don't have strict details of the CPU versions there, but I know that the percentage of AVX2 is likely very low, the whole computer park doesn't update like this, specially in China for example...)

It is also hard to say if u20 version will be faster than u10.

That's what I would like to know. Because just on the difference of the number of instructions between the u35 and u10 versions, that's a massive diff there...

I will run the test with a AVX2 version (but only with an input float/output float) just to check it against the plain u10 version

shibatch commented 6 years ago

Okay, I will consider but this would take time. For usage in unity, why do you need 1 ULP accuracy? I think 3.5 ULP is enough for gaming?

xoofx commented 6 years ago

Okay, I will consider but this would take time.

I'm not saying that we absolutely need this, but if we could first verify this on sin for example, and decide next what to do.

For usage in unity, why do you need 1 ULP accuracy? I think 3.5 ULP is enough for gaming?

The usage could vary from ULP, where we would like even to have exact deterministic precision across platforms , specially for things like physics...etc. that can deviate quickly, and if you need to synchronize running processes across different machines, that's could be important. But as you suggested, also many compute could deal with lower precision in some cases (I'm not saying that correct sleef ULP10 alone solves this particular case of precision but to illustrate the range of precision we are looking for...)

Basically, that would be great to have the choice to select the precision accuracy ranging from u10, u20, u35, u60, assuming that performance gain across them would still be noticeable.

shibatch commented 6 years ago

The usage could vary from ULP, where we would like even to have exact deterministic precision across platforms , specially for things like physics...etc. that can deviate quickly, and if you need to synchronize running processes across different machines, that's could be important. But as you suggested, also many compute could deal with lower precision in some cases (I'm not saying that correct sleef ULP10 alone solves this particular case of precision but to illustrate the range of precision we are looking for...)

If you are talking about having consistent results between different platforms, then I think it is more related to https://github.com/shibatch/sleef/issues/47 than accuracy. But, basically it is almost impossible to have bit-identical results between different architectures since the usable vector extensions are different. Unity and the video game executable have to be also compiled with FP contraction and other optimization disabled. I think this is not an easy problem and we need to think carefully.

xoofx commented 6 years ago

If you are talking about having consistent results between different platforms, then I think it is more related to #47 than accuracy.

Yep, it was just to illustrate the range of problem we are looking into.

Most important for now is performance of functions with the choice of precision.

shibatch commented 6 years ago

The main difference between u10 and u35 is whether DD operators are used. DD operators are required to achieve 1 ULP accuracy, and u35 functions are made by removing DD operators.

It is possible to make the output of non-FMA vector functions consistent over all platforms. And by using 3.5 ULP functions, the computation speed would not be a large problem. Does this work for you?

Another direction is to implement correctly rounded functions. This would guarantee consistent results over all platforms, but it would be slower than 1 ULP functions. I like this idea of implementing correctly rounded functions, but the question is if there is really demand for such functions.

carlkl commented 6 years ago

Scalar functions could be implemented in a different way:

in a much more limited argument range it may be possible to use a fast implementation without DD operators and get a precision similar to the _u10 implementations. For arguments beyond this range the existing implementation with DD operators could be used. This is not branch free, but for scalar functions it does not matter IMHO. Is this feasible with sleef?

shibatch commented 6 years ago

Is there any documentation on how to do it?

carlkl commented 6 years ago

No, its just a naive idea and question from my side. Personally I have no clue how far one can get with the precision to interpolate trigonometric functions if smaller argument ranges are used.

shibatch commented 6 years ago

If it is possible to improve the computation speed in that way, someone would have already done it. You know, scalar computation of trigonometric functions is an older problem than the vector problem.

carlkl commented 6 years ago

@shibatch, thank you for that explanation.

shibatch commented 6 years ago

I am now considering adding FMA support to aarch32. AArch32 processors with vfpv4 support have FMA. Such processors include cortex-a5, a7, a15 and a17. So, this will benefit almost all low-end smartphones.