Closed syoyo closed 4 years ago
@maikschulze Could you please try adding more NewtonRaphson iterations and check if it will solve hole/artifact issues in curve and displacement test scene?
Hi @syoyo , thank you very for filing this bug. I haven't had the time yet to test the influence of different amounts of Newton-Raphson iterations yet. Generally, most holes in the curve have disappeared with 2 iterations in NEON and the formula by Intel -> 3 in total.
However, I could not get rid off one particular hole even with 8 iterations for this ray:
RTCIntersectContext context;
rtcInitIntersectContext(&context);
RTCRayHit rayhitdebug;
rayhitdebug.ray.org_x = -0.000000476837158;
rayhitdebug.ray.org_y = 4.99999952;
rayhitdebug.ray.org_z = -7.07106781;
rayhitdebug.ray.dir_x = 0.635353327;
rayhitdebug.ray.dir_y = -0.5503245;
rayhitdebug.ray.dir_z = 0.541727901;
rayhitdebug.ray.time = 0.0;
rayhitdebug.ray.tnear = 0.0;
rayhitdebug.ray.tfar = 100000.0;
rayhitdebug.ray.mask = -1;
rayhitdebug.ray.id = 0;
rayhitdebug.ray.flags = 0;
rtcIntersect1(g_scene, &context, &rayhitdebug);
This ray is a bit special because it hits the curve of the interpolation tutorial at (u=0.860379398, v =0). I've then started debugging this on my smartphone, but ran into weird seg faults again in the function intersect_bezier_recursive_jacobian. After carefully debugging the difference, I believe I've found one important difference that is independent of numerical precision.
The CylinderN::intersect function creates NaN values with sqrt(D)
along with a mask that marks then as non-valid.
It seems that this block 'repairs' the NaN values:
/* clamp and correct u parameter */
u_outer0 = clamp(u_outer0,vfloatx(0.0f),vfloatx(1.0f));
u_outer1 = clamp(u_outer1,vfloatx(0.0f),vfloatx(1.0f));
u_outer0 = lerp(u0,u1,(vfloatx(step)+u_outer0)*(1.0f/float(vfloatx::size)));
u_outer1 = lerp(u0,u1,(vfloatx(step)+u_outer1)*(1.0f/float(vfloatx::size)));
on x64:
u_outer0_before = {-nan(ind), 1.61889660, 0.594353378, 0.295167416}
u_outer1_before = {-nan(ind), 1.50396919, 0.380308062, 1.70483279}
u_outer0_after = {1.00000000, 1.00000000, 0.594353378, 0.295167416}
u_outer1_after = {1.00000000, 1.00000000, 0.380308062, 1.00000000}
but not on aarch64:
u_outer0_before = {NaN, 1.61888874, 0.594353676, 0.295167357}
u_outer1_before = {NaN, 1.50397587, 0.380308151, 1.70483422}
u_outer0_after ={NaN, 1, 0.594353676, 0.295167357}
u_outer1_after = {NaN, 1, 0.380308151, 1}
I will try to solve this and continue from there. It's a bit tedious for my to debug this in Android Studio.
@maikschulze Good catch!
As you may know, min/max in SSE2 correctly(?) handles NaN
whereas ARM NEON's vmin/vmax is not.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHDEEBE.html
it returns NaN when any input of vmin/vmax is NaN.
Fortunately, from ARMv8(or AARCH64), there is vminnm and vmaxnm intrinsics
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHFCJCF.html
which correctly handles NaN according to IEEE754-2008. So, replacing vmaxq_f32
and vminq_f32
with vmaxnmq_f32
and vminnmq_f32
respectively should solve the inconsistency of clamp()
behavior.
It's a bit tedious for my to debug this in Android Studio.
If you can prepare minimal and reproducible code(e.g. like this https://github.com/lighttransport/embree-aarch64/tree/master/test/curve), I can run the code on AARCH64 linux(Jetson AGX).
@syoyo , you're amazing. I quickly addedvminnmq_f32
and vmaxnmq_f32
into SSE2NEON.h and there's no more hole. Nan's are now properly replaced in the curve algorithm. I will test more and create a submit.
Thank you very much!
I hope these debugging sessions will get easier once I've create a public test collection. I have various ideas on making it precise and convenient for everyone given my learning experience.
@maikschulze :tada: Yeah, PR is very appreciated.
Hi, my PR will take a bit longer.
While the curve intersections now seem fixed with vminnmq_f32
and vmaxnmq_f32
, my hair intersection test crashes with an assert in bvh_traverser1.h
assert(c0 != BVH::emptyNode);
I will have to understand and fix this first.
@maikschulze I have implemented correct(as far as I've tested) implementation of _mm_min_ps and _mm_max_ps in aarch64-v3.11.0
branch: https://github.com/lighttransport/embree-aarch64/commit/e4a2f68d1d849de4521c495bb6fd9a5f6017ad0d
It should give more consistent result with x86-64 result.
Could you try aarch64-v3.11.0
branch? It also have some improvements for curves(hairs) from original intel embree v3.11.0.
Hi @syoyo,
I have pulled the latest state https://github.com/lighttransport/embree-aarch64/commit/7eb0b58daf063f466fa52ea10227c90c8b8ef2ab and created test results for windows-x64 and android-arm64.
Unfortunately, the infamous hole is still present in the curve for arm64 (right), whereas x64 (left) has no such issue:
I have not debugged into the setup. The artifact looks similar to the discovered lack of 'NaN repair' in function CylinderN::intersect in my test scene (interpolation tutorial).
@maikschulze Thanks!
There is still a bug we need to hunt.
I have added test/numric
test in recent aarch64-v3.11.0
branch https://github.com/lighttransport/embree-aarch64/tree/aarch64-v3.11.0/test/numeric
and confirmed at least NaN repair
now works fine on aarch64.
void issue_24()
{
typedef embree::vfloat4 vfloatx;
vfloatx a(-std::numeric_limits<float>::signaling_NaN());
vfloatx b = clamp(a, vfloatx(0.0f), vfloatx(1.0f));
printf("a.x = %f\n", a[0]);
printf("b.x = %f\n", b[0]); // expects 1.0
vfloatx c(-std::numeric_limits<float>::quiet_NaN());
vfloatx d = clamp(c, vfloatx(0.0f), vfloatx(1.0f));
printf("c.x = %f\n", c[0]);
printf("d.x = %f\n", d[0]); // expects 1.0
}
a.x = -nan
b.x = 1.000000
c.x = -nan
d.x = 1.000000
@maikschulze I think I have fixed the hole issue for curve(tube) geometry.
NEON implementation of _mm_div_ps and _mm_rsqrt_ps requires one(or two) more Newton-Raphson iteration than SSE2 to get the same level of the accuracy of SSE2.
Could you try aarch64-v3.11.0-rcp-improve
branch? https://github.com/lighttransport/embree-aarch64/tree/aarch64-v3.11.0-rcp-improve
This branch uses 4 Newton-Raphson iteration for div and rsqrt. (4 may be overkill and we may be able to reduce it to 3)
At least I could now confirm apparently no holes onto the curve(tube) geometry anymore for interpolation
tutorial. :tada:
Hi @syoyo , I've tested https://github.com/lighttransport/embree-aarch64/commit/a69200d0641f839e579d0fb045c45de7e1d3ebae on Android aarch64 and cannot observe any improvement, unfortunately. The hole in the curve remains. I've then tried to increase the amount of iterations further in various related functions, but could not observe any change.
For sanity: is there any special compile flag that should be used? Can you confirm that you do not get such a hole in release mode?
@maikschulze Sorry, there was a missing commit. Please try this commit(commited to aarch64-v3.11.0-rcp-improvement
branch): https://github.com/lighttransport/embree-aarch64/commit/a435e0e151047409c878962055fb3215dd9968ba
FYI, I am running interpolation
on Jetson AGX.
Here is the cmake configuration and CXX flags. I am using clang-10.
$CMAKE_BIN \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DEMBREE_ARM=On \
-DEMBREE_ADDRESS_SANITIZER=Off \
-DCMAKE_INSTALL_PREFIX=$HOME/local/embree3 \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DEMBREE_ISPC_SUPPORT=Off \
-DEMBREE_TASKING_SYSTEM=Internal \
-DEMBREE_TUTORIALS=On \
-DEMBREE_TUTORIALS_GLFW=On \
-DEMBREE_MAX_ISA=SSE2 \
-DEMBREE_RAY_PACKETS=Off \
..
...
/home/syoyo/local/clang+llvm-10.0.0-aarch64-linux-gnu/bin/clang++ -DEMBREE_TARGET_SSE2 -DTASKING_INTERNAL -Wall -Wformat -Wformat-security -fPIE -fPIC -std=c++11 -fvisibility=hidden -fvisibility-inlines-hidden -fno-strict-aliasing -fno-tree-vectorize -D_FORTIFY_SOURCE=2 -g -DNDEBUG -O3 -o CMakeFiles/algorithms.dir/parallel_for.cpp.o -c /home/syoyo/work/embree-aarch64/common/algorithms/parallel_for.cpp
...
Hi @syoyo , unfortunately the hole is still here. I've tested your latest code on Android arm64 and iOS arm64. The results there are identical and show the hole ;(
@maikschulze I see. Could you please open another issue with minimal reproducible scene data and code(or view configuration for interpolation
)? The issue may be related to OS or compiler.
And I'll close this since I've already fixed(improved) min/max and rcp/rsqrt accuracy in aarch64-v3.11.0-rcp-improve
branch.
@syoyo , I'm optimistic that I have identified the issue being caused by sqrt(0)-> +inf.
After a coffee break I will test further and create an appropriate issue with repro and pull request.
@maikschulze Super awesome! BTW, I have merged aarch64-v3.11.0-rcp-improve
to master
, so please use master
branch when you create a pull request.
ARM NEON's rcp estimate and rsqt estimate has less accuracy than corresponding SSE2 ops.
https://qiita.com/sanmanyannyan/items/62bb5ce6ada975a7106a
We need to increase the iteration of NewtonRaphson steps(2 or 3 times more iterations) to get the same level of the accuracy of rcp, rsqrt in SSE2 code path (estimate + one round of NewtonRaphson)
Currently we use use 2 iterations for NEON code path(vrcpsq_f32, vrsqrtsq_f32)
https://github.com/lighttransport/embree-aarch64/blob/3f75f8cb4e553d13dced941b5fefd4c826835a6b/common/math/math.h#L73
Relates hole issue in #20