Improve the accuracy of minps, maxps, rcp, rsqrt in NEON

syoyo commented 4 years ago

ARM NEON's rcp estimate and rsqt estimate has less accuracy than corresponding SSE2 ops.

https://qiita.com/sanmanyannyan/items/62bb5ce6ada975a7106a

We need to increase the iteration of NewtonRaphson steps(2 or 3 times more iterations) to get the same level of the accuracy of rcp, rsqrt in SSE2 code path (estimate + one round of NewtonRaphson)

Currently we use use 2 iterations for NEON code path(vrcpsq_f32, vrsqrtsq_f32)

https://github.com/lighttransport/embree-aarch64/blob/3f75f8cb4e553d13dced941b5fefd4c826835a6b/common/math/math.h#L73

Relates hole issue in #20

syoyo commented 4 years ago

@maikschulze Could you please try adding more NewtonRaphson iterations and check if it will solve hole/artifact issues in curve and displacement test scene?

maikschulze commented 4 years ago

Hi @syoyo , thank you very for filing this bug. I haven't had the time yet to test the influence of different amounts of Newton-Raphson iterations yet. Generally, most holes in the curve have disappeared with 2 iterations in NEON and the formula by Intel -> 3 in total.

However, I could not get rid off one particular hole even with 8 iterations for this ray:

RTCIntersectContext context;
rtcInitIntersectContext(&context);

RTCRayHit rayhitdebug;
rayhitdebug.ray.org_x = -0.000000476837158;
rayhitdebug.ray.org_y = 4.99999952;
rayhitdebug.ray.org_z = -7.07106781;

rayhitdebug.ray.dir_x = 0.635353327;
rayhitdebug.ray.dir_y = -0.5503245;
rayhitdebug.ray.dir_z = 0.541727901;

rayhitdebug.ray.time = 0.0;
rayhitdebug.ray.tnear = 0.0;
rayhitdebug.ray.tfar = 100000.0;
rayhitdebug.ray.mask = -1;
rayhitdebug.ray.id = 0;
rayhitdebug.ray.flags = 0;

rtcIntersect1(g_scene, &context, &rayhitdebug);

This ray is a bit special because it hits the curve of the interpolation tutorial at (u=0.860379398, v =0). I've then started debugging this on my smartphone, but ran into weird seg faults again in the function intersect_bezier_recursive_jacobian. After carefully debugging the difference, I believe I've found one important difference that is independent of numerical precision.

The CylinderN::intersect function creates NaN values with sqrt(D) along with a mask that marks then as non-valid.

It seems that this block 'repairs' the NaN values:

/* clamp and correct u parameter */
      u_outer0 = clamp(u_outer0,vfloatx(0.0f),vfloatx(1.0f));
      u_outer1 = clamp(u_outer1,vfloatx(0.0f),vfloatx(1.0f));
      u_outer0 = lerp(u0,u1,(vfloatx(step)+u_outer0)*(1.0f/float(vfloatx::size)));
      u_outer1 = lerp(u0,u1,(vfloatx(step)+u_outer1)*(1.0f/float(vfloatx::size)));

on x64:

u_outer0_before = {-nan(ind), 1.61889660, 0.594353378, 0.295167416}
u_outer1_before = {-nan(ind), 1.50396919, 0.380308062, 1.70483279}

u_outer0_after  = {1.00000000, 1.00000000, 0.594353378, 0.295167416}
u_outer1_after  = {1.00000000, 1.00000000, 0.380308062, 1.00000000}

but not on aarch64:

u_outer0_before = {NaN, 1.61888874, 0.594353676, 0.295167357}
u_outer1_before = {NaN, 1.50397587, 0.380308151, 1.70483422}

u_outer0_after  ={NaN, 1, 0.594353676, 0.295167357}
u_outer1_after  = {NaN, 1, 0.380308151, 1}

I will try to solve this and continue from there. It's a bit tedious for my to debug this in Android Studio.

syoyo commented 4 years ago

@maikschulze Good catch!

As you may know, min/max in SSE2 correctly(?) handles NaN

https://stackoverflow.com/questions/40196817/what-is-the-instruction-that-gives-branchless-fp-min-and-max-on-x86

whereas ARM NEON's vmin/vmax is not.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHDEEBE.html

it returns NaN when any input of vmin/vmax is NaN.

Fortunately, from ARMv8(or AARCH64), there is vminnm and vmaxnm intrinsics

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHFCJCF.html

which correctly handles NaN according to IEEE754-2008. So, replacing vmaxq_f32 and vminq_f32 with vmaxnmq_f32 and vminnmq_f32 respectively should solve the inconsistency of clamp() behavior.

It's a bit tedious for my to debug this in Android Studio.

If you can prepare minimal and reproducible code(e.g. like this https://github.com/lighttransport/embree-aarch64/tree/master/test/curve), I can run the code on AARCH64 linux(Jetson AGX).

maikschulze commented 4 years ago

@syoyo , you're amazing. I quickly addedvminnmq_f32and vmaxnmq_f32into SSE2NEON.h and there's no more hole. Nan's are now properly replaced in the curve algorithm. I will test more and create a submit.

Thank you very much!

I hope these debugging sessions will get easier once I've create a public test collection. I have various ideas on making it precise and convenient for everyone given my learning experience.

syoyo commented 4 years ago

@maikschulze :tada: Yeah, PR is very appreciated.

maikschulze commented 4 years ago

Hi, my PR will take a bit longer. While the curve intersections now seem fixed with vminnmq_f32 and vmaxnmq_f32, my hair intersection test crashes with an assert in bvh_traverser1.h assert(c0 != BVH::emptyNode); I will have to understand and fix this first.

syoyo commented 4 years ago

@maikschulze I have implemented correct(as far as I've tested) implementation of _mm_min_ps and _mm_max_ps in aarch64-v3.11.0 branch: https://github.com/lighttransport/embree-aarch64/commit/e4a2f68d1d849de4521c495bb6fd9a5f6017ad0d

It should give more consistent result with x86-64 result.

Could you try aarch64-v3.11.0 branch? It also have some improvements for curves(hairs) from original intel embree v3.11.0.

maikschulze commented 4 years ago

Hi @syoyo,

I have pulled the latest state https://github.com/lighttransport/embree-aarch64/commit/7eb0b58daf063f466fa52ea10227c90c8b8ef2ab and created test results for windows-x64 and android-arm64.

Unfortunately, the infamous hole is still present in the curve for arm64 (right), whereas x64 (left) has no such issue:

I have not debugged into the setup. The artifact looks similar to the discovered lack of 'NaN repair' in function CylinderN::intersect in my test scene (interpolation tutorial).

syoyo commented 4 years ago

@maikschulze Thanks!

There is still a bug we need to hunt.

I have added test/numric test in recent aarch64-v3.11.0 branch https://github.com/lighttransport/embree-aarch64/tree/aarch64-v3.11.0/test/numeric

and confirmed at least NaN repair now works fine on aarch64.

void issue_24()
{
  typedef embree::vfloat4 vfloatx;

  vfloatx a(-std::numeric_limits<float>::signaling_NaN());

  vfloatx b = clamp(a, vfloatx(0.0f), vfloatx(1.0f));

  printf("a.x = %f\n", a[0]);
  printf("b.x = %f\n", b[0]); // expects 1.0

  vfloatx c(-std::numeric_limits<float>::quiet_NaN());

  vfloatx d = clamp(c, vfloatx(0.0f), vfloatx(1.0f));

  printf("c.x = %f\n", c[0]);
  printf("d.x = %f\n", d[0]); // expects 1.0
}

a.x = -nan
b.x = 1.000000
c.x = -nan
d.x = 1.000000

syoyo commented 4 years ago

@maikschulze I think I have fixed the hole issue for curve(tube) geometry.

NEON implementation of _mm_div_ps and _mm_rsqrt_ps requires one(or two) more Newton-Raphson iteration than SSE2 to get the same level of the accuracy of SSE2.

Could you try aarch64-v3.11.0-rcp-improve branch? https://github.com/lighttransport/embree-aarch64/tree/aarch64-v3.11.0-rcp-improve

This branch uses 4 Newton-Raphson iteration for div and rsqrt. (4 may be overkill and we may be able to reduce it to 3)

At least I could now confirm apparently no holes onto the curve(tube) geometry anymore for interpolation tutorial. :tada:

maikschulze commented 4 years ago

Hi @syoyo , I've tested https://github.com/lighttransport/embree-aarch64/commit/a69200d0641f839e579d0fb045c45de7e1d3ebae on Android aarch64 and cannot observe any improvement, unfortunately. The hole in the curve remains. I've then tried to increase the amount of iterations further in various related functions, but could not observe any change.

For sanity: is there any special compile flag that should be used? Can you confirm that you do not get such a hole in release mode?

syoyo commented 4 years ago

@maikschulze Sorry, there was a missing commit. Please try this commit(commited to aarch64-v3.11.0-rcp-improvement branch): https://github.com/lighttransport/embree-aarch64/commit/a435e0e151047409c878962055fb3215dd9968ba

FYI, I am running interpolation on Jetson AGX.

Here is the cmake configuration and CXX flags. I am using clang-10.

$CMAKE_BIN \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DEMBREE_ARM=On \
  -DEMBREE_ADDRESS_SANITIZER=Off \
  -DCMAKE_INSTALL_PREFIX=$HOME/local/embree3 \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_TUTORIALS_GLFW=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..

...
/home/syoyo/local/clang+llvm-10.0.0-aarch64-linux-gnu/bin/clang++  -DEMBREE_TARGET_SSE2 -DTASKING_INTERNAL  -Wall -Wformat -Wformat-security -fPIE -fPIC -std=c++11 -fvisibility=hidden -fvisibility-inlines-hidden -fno-strict-aliasing -fno-tree-vectorize -D_FORTIFY_SOURCE=2  -g -DNDEBUG -O3     -o CMakeFiles/algorithms.dir/parallel_for.cpp.o -c /home/syoyo/work/embree-aarch64/common/algorithms/parallel_for.cpp
...

maikschulze commented 4 years ago

Hi @syoyo , unfortunately the hole is still here. I've tested your latest code on Android arm64 and iOS arm64. The results there are identical and show the hole ;(

syoyo commented 4 years ago

@maikschulze I see. Could you please open another issue with minimal reproducible scene data and code(or view configuration for interpolation)? The issue may be related to OS or compiler.

And I'll close this since I've already fixed(improved) min/max and rcp/rsqrt accuracy in aarch64-v3.11.0-rcp-improve branch.

maikschulze commented 4 years ago

@syoyo , I'm optimistic that I have identified the issue being caused by sqrt(0)-> +inf.

After a coffee break I will test further and create an appropriate issue with repro and pull request.

syoyo commented 4 years ago

@maikschulze Super awesome! BTW, I have merged aarch64-v3.11.0-rcp-improve to master, so please use master branch when you create a pull request.

lighttransport / embree-aarch64

Improve the accuracy of minps, maxps, rcp, rsqrt in NEON #24