openworm / OpenWorm

Repository for the main Dockerfile with the OpenWorm software stack and project-wide issues
http://openworm.org
MIT License
2.67k stars 208 forks source link

errorcalculateProjectionOfPointToPlane() #161

Closed vellamike closed 10 years ago

vellamike commented 10 years ago

calculateProjectionOfPointToPlane()

When I run on Linux I often get the error message:

calculateProjectionOfPointToPlane() returned errorcalculateProjectionOfPointToPlane() returned errorcalculateProjectionOfPointToPlane() returned errorcalculateProjectionOfPointToPlane() returned errorcalculateProjectionOfPointToPlane()...

And so on for hundreds of times. Occasionally it doesn't happen, indicating some kind of race condition?

Any idea why this might be happening @a-palyanov ?

Neurophile commented 10 years ago

In

__kernel void computeInteractionWithMembranes

calls

pos_p = calculateProjectionOfPointToPlane(...

calls

denominator = calcDeterminant3x3(a_1,a_2,a_3);

which runs the math

    return  c1[1]*c2[2]*c3[3] + c1[2]*c2[3]*c3[1] + c1[3]*c2[1]*c3[2]  
          - c1[3]*c2[2]*c3[1] - c1[1]*c2[3]*c3[2] - c1[2]*c2[1]*c3[3];

and if that result is = 0 you get the error message. Taking a stab in the dark, computeInteractionWithMembranes is both reading from and writing to various parts of __global float4 *position which is the ultimate source of the data in the above calculation. I'm still trying to understand how memory synchronization works but I think we need a barrier. Try adding one in computeInteractionWithMembranes here:

    barrier(CLK_GLOBAL_MEM_FENCE);
    if(membrane_jd_counter>0)
    {

That isolates the parts that read position from the parts that write it and should pause every thread at the barrier until all have reached it.

vellamike commented 10 years ago

Great idea, unfortunately it didn't do the trick. Still getting the same error.

vellamike commented 10 years ago

CORRECTION: with the barrier I actual get a segfault at the end with AMD (and the original error with Intel), here's the full stack trace:

Pyramidal simulation class loaded!
 CL_PLATFORM_VERSION [0]:   OpenCL 1.2 AMD-APP (1214.3)
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_NAME [0]:    AMD Opteron(TM) Processor 6272                 
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_WORK_GROUP_SIZE [0]:     1024
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_COMPUTE_UNITS [0]:   64
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_SIZE [0]:     2032017408
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_CACHE_SIZE [0]:   16384
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_LOCAL_MEM_SIZE [0]:  32768

[[ Step 0 ]]
_runClearBuffers:       8.165 ms
_runHashParticles:      4.037 ms
_runSort:          16.651 ms
_runSortPostPass:       3.325 ms
_runIndexx:            11.989 ms
_runIndexPostPass:      1.676 ms
_runFindNeighbors:     32.445 ms
_runPCISPH:            69.302 ms    3 iteration(s)
calculateProjectionOfPointToPlane() returned errorcalculateProjectionOfPointToPlane() returned errorcalculateProjectionOfPointToPlane() returned errorSegmentation fault (core dumped)
vellamike commented 10 years ago

I added the following print statements to help understand what is going on:

        float4 b = (float4)(0, b_1, b_2, b_3);                                                                                                         

        denominator = calcDeterminant3x3(a_1,a_2,a_3);                                                                                                 

        printf("\na_1 = %2.2v4hlf", a_1);                                                                                                              
        printf("\na_2 = %2.2v4hlf", a_2);                                                                                                              
        printf("\na_3 = %2.2v4hlf", a_3);                                                                                                              
        printf("\ndenominator = %2.2v4hlf", denominator);        

And this is the output (not always the same, this is just an example):

a_1 = 0.00,-0.25,-1.71,-1.71
a_1 = 0.00,1.23,-1.75,-0.03
a_1 = 0.00,1.64,1.71,1.22
a_1 = 0.00,-0.27,-1.80,-1.80
a_2 = 0.00,-3.00,0.16,0.18
a_3 = 0.00,-0.02,0.00,-1.67
denominator = 0.00,-0.02,0.00,-1.67calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,0.27,-1.78,-1.78
a_2 = 0.00,2.98,0.16,0.18
a_3 = 0.00,-0.03,0.00,1.67
denominator = 0.00,-0.03,0.00,1.67calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,2.76,-0.20,-0.10
a_2 = 0.00,0.33,1.66,0.00
a_3 = 0.00,0.17,-0.00,1.67
denominator = 0.00,0.17,-0.00,1.67calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,-0.28,-1.82,-1.82
a_2 = 0.00,-3.05,0.17,0.17
a_3 = 0.00,-0.01,0.00,-1.67
denominator = 0.00,-0.01,0.00,-1.67calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,1.88,1.45,1.39
a_2 = 0.00,-2.42,1.13,1.16
a_3 = 0.00,0.13,0.00,1.67
denominator = 0.00,0.13,0.00,1.67calculateProjectionOfPointToPlane() returned errorSegmentation fault (core dumped)

I'm a bit perplexed now since none of those denominators are equal to zero.

vellamike commented 10 years ago

But, if I add the following:

    else {                                                                                                                                         
            printf("\ndenominator equal to zero\n");                                                                                                 
            pm.w = -1;//indicates error                                                                                                            
          }                                                                                                                                              

This is the (inconsistent) output:

a_1 = 0.00,-2.51,0.72,0.04
a_2 = 0.00,-1.21,-1.51,0.00
a_1 = 0.00,1.70,1.53,1.49
a_1 = 0.00,2.76,-0.20,-0.10
a_1 = 0.00,-0.28,-1.83,-0.00
a_2 = 0.00,3.06,-0.17,0.00
a_3 = 0.00,-0.00,0.00,1.67
denominator = 0.00,-0.00,0.00,1.67
denominator equal to zero
calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,2.68,-0.46,-0.41
a_2 = 0.00,0.77,1.60,1.60
a_3 = 0.00,-0.08,0.00,1.67
a_1 = 0.00,-0.27,-1.68,-1.68
a_1 = 0.00,0.53,-1.69,-0.02
a_2 = 0.00,2.81,0.32,-0.04
a_3 = 0.00,0.08,0.00,1.67
denominator = 0.00,0.08,0.00,1.67
a_1 = 0.00,1.88,1.45,1.39
a_2 = 0.00,-2.42,1.13,1.16
a_3 = 0.00,0.13,0.00,1.67
denominator = 0.00,0.13,0.00,1.67
denominator equal to zero
calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,0.31,-1.81,-1.81
a_2 = 0.00,3.02,0.18,0.24
a_3 = 0.00,-0.09,-0.00,1.67
denominator = 0.00,-0.09,-0.00,1.67
denominator equal to zero
calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,1.53,-1.45,-1.42
a_2 = 0.00,2.43,0.91,0.94
a_3 = 0.00,-0.08,0.00,1.67
denominator = 0.00,-0.08,0.00,1.67
denominator equal to zero
calculateProjectionOfPointToPlane() returned errorSegmentation fault (core dumped)

Now I'm extremely confused, because all of those denominator values which were printed should satisfy the !=0 condition in the if statement.

vellamike commented 10 years ago

There was a confusing error in the above print statement, the actual output should be:

 CL_PLATFORM_VERSION [0]:   OpenCL 1.2 AMD-APP (1214.3)
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_NAME [0]:    AMD Opteron(TM) Processor 6272                 
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_WORK_GROUP_SIZE [0]:     1024
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_COMPUTE_UNITS [0]:   64
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_SIZE [0]:     2032017408
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_CACHE_SIZE [0]:   16384
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_LOCAL_MEM_SIZE [0]:  32768

[[ Step 0 ]]
_runClearBuffers:       5.495 ms
_runHashParticles:      3.001 ms
_runSort:          29.882 ms
_runSortPostPass:       1.959 ms
_runIndexx:             6.443 ms
_runIndexPostPass:      1.783 ms
_runFindNeighbors:     43.711 ms
_runPCISPH:            81.370 ms    3 iteration(s)

a_1 = 0.00,-0.28,-1.82,-1.82
a_2 = 0.00,-3.04,0.17,0.18
a_3 = 0.00,-0.02,0.00,-1.67
denominator = 0.000000
denominator equal to zero
calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,-0.28,-1.81,-1.81
a_1 = 0.00,-0.28,-1.82,-1.82
a_2 = 0.00,-3.05,0.17,0.17
a_2 = 0.00,-3.02,0.17,0.18
a_3 = 0.00,-0.01,0.00,-1.67
a_3 = 0.00,-0.02,0.00,-1.67
denominator = 0.000000
denominator = 0.000000
denominator equal to zero
calculateProjectionOfPointToPlane() returned error
denominator equal to zero
calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,-0.25,-1.71,-1.71
a_1 = 0.00,-0.27,-1.77,-0.01
a_1 = 0.00,2.76,-0.20,-0.10
a_1 = 0.00,1.69,-1.82,-1.16
a_1 = 0.00,-2.51,0.72,0.04
a_1 = 0.00,1.70,1.53,1.49
a_2 = 0.00,-2.56,1.02,1.04
a_1 = 0.00,1.88,1.45,1.39
a_1 = 0.00,1.59,-1.47,-0.11
a_1 = 0.00,-2.55,-0.68,-0.67
a_1 = 0.00,-1.51,1.41,0.19
a_2 = 0.00,-1.21,-1.51,0.00
a_3 = 0.00,0.07,0.00,1.67
denominator = 0.000000
a_2 = 0.00,0.33,1.66,0.00
a_1 = 0.00,0.26,-1.74,-1.74
denominator equal to zero
calculateProjectionOfPointToPlane() returned error
a_1 = 0.00,-2.78,-0.14,-0.03Segmentation fault (core dumped)

Which of course makes sense, for some reason the determinant calculation code is returning 0.0.

vellamike commented 10 years ago

And I think it's working. I will need to check it tomorrow since I don't have a working OpenGL install on this machine, I'll report on the fix if I have indeed fixed it.

vellamike commented 10 years ago

OK - so it's working now with both AMD OpenCL1.2 and Intel OpenCL1.1 for Linux! - thank @Neurophile @slarson and @skhayrulin in particular for helping me get this to work.

It required quite a few changes to sphFluid.cl to remove things which were not legal OpenCL syntax.

Some observations:

> Error!: r_ij: 0.000000 
> sortedPosition[85112] : 0.000000 , 66.799995 , 835.000000 
> sortedPosition[85103] : 0.000000 , 66.799995 , 835.000000 
> Error!: r_ij: 0.000000 
> sortedPosition[85112] : 0.000000 , 66.799995 , 835.000000 
> sortedPosition[85104] : 0.000000 , 66.799995 , 835.000000 
> Error!: r_ij: 0.000000 
> sortedPosition[85112] : 0.000000 , 66.799995 , 835.000000