mkazhdan / PoissonRecon

Poisson Surface Reconstruction
MIT License
1.55k stars 424 forks source link

Race condition when running on ppc64le platform #190

Open piotrmaslanka opened 3 years ago

piotrmaslanka commented 3 years ago

We (Dronehub) have been using PoissonRecon as part of opendronemap. Recently, we've managed to make it run under ppc64le, but have ran into a problem related to PoissonRecon. Namely, when I force PoissonRecon to run two or more threads, it fails to close loops. Running it on one thread makes the thing run OK. Could you point me in any place where you suppose there may be race conditions? I'll attach the example dataset later, as I need to extract it from the OpenDroneMap pipeline.

mkazhdan commented 3 years ago

Cool. I'm not familiar with ppc64le, but I have heard about problems with multi-threading on non-desktop/laptop hardware. I don't really know how to fix it. (I certainly made an effort to eliminate race conditions.)

One possibility could be that the implementation of OpenMP is not what it should be on the hardware. You could try using a different parallelization framework.

In particular, the reconstruction algorithm supports a "--parallel" flag. The default value is "0", corresponding to OpenMP. But you could also try using "1" or "2" which use two different parallelization methods, both built off of C++ supported parallelization constructs. (You could also try switching the schedule type using the "--schedule" parameter from the default dynamic, "1", to static "0".)

piotrmaslanka commented 3 years ago

Thank you. I'll try playing around with --parallel and report back the results. By the way, this looks like the issue #139 .

piotrmaslanka commented 3 years ago

--parallel 1 yields me the following:

[39m[INFO]    running /code/SuperBuild/src/PoissonRecon/Bin/Linux/PoissonRecon --in /datasets/aukerman/odm_filterpoints/point_cloud.ply --out /datasets/au
kerman/odm_meshing/odm_mesh.dirty.ply --depth 11 --pointWeight 4.0 --samplesPerNode 1.0 --parallel 1 --threads 159 --linearFit --verboseESC[0m
*************************************************************
*************************************************************
** Running Screened Poisson Reconstruction (Version 13.72) **
*************************************************************
*************************************************************
        --in /datasets/aukerman/odm_filterpoints/point_cloud.ply
        --depth 11
        --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply
        --verbose
        --samplesPerNode 1.000000
        --pointWeight 4.000000
        --threads 159
        --linearFit
        --parallel 1
Input Points / Samples: 8249692 / 2388397
# Read input into tree:      34.7 (s),     486.8 (MB) /     486.8 (MB) / 548 (MB)
#   Got kernel density:       1.2 (s),     552.2 (MB) /     552.2 (MB) / 552 (MB)
#     Got normal field:      10.5 (s),    1132.1 (MB) /    1132.1 (MB) / 1132 (MB)
Point weight / Estimated Measure: 5.76306e-08 / 0.475435
#Initialized point interpolation constraints:       1.1 (s),    1170.2 (MB) /    1170.2 (MB) / 1170 (MB)
#       Finalized tree:      16.8 (s),    1933.2 (MB) /    1933.2 (MB) / 1933 (MB)
#  Set FEM constraints:       2.5 (s),    1622.8 (MB) /    1933.2 (MB) / 1933 (MB)
#Set point constraints:       0.2 (s),    1622.8 (MB) /    1933.2 (MB) / 1933 (MB)
Leaf Nodes / Active Nodes / Ghost Nodes / Dirichlet Supported Nodes: 16712781 / 19088648 / 11673 / 0
Memory Usage: 1622.812 MB
Cycle[0] Depth[ 5/11]:  Updated constraints / Got system / Solved in:  0.002 /  0.040 /  0.189  (1625.000 MB)   Nodes: 35937
Cycle[0] Depth[ 6/11]:  Updated constraints / Got system / Solved in:  0.005 /  0.032 /  0.321  (1632.375 MB)   Nodes: 39384
Cycle[0] Depth[ 7/11]:  Updated constraints / Got system / Solved in:  0.009 /  0.048 /  0.550  (1641.500 MB)   Nodes: 139264
Cycle[0] Depth[ 8/11]:  Updated constraints / Got system / Solved in:  0.023 /  0.092 /  0.936  (1666.312 MB)   Nodes: 520600
Cycle[0] Depth[ 9/11]:  Updated constraints / Got system / Solved in:  0.063 /  0.210 /  1.476  (1757.000 MB)   Nodes: 1840832
Cycle[0] Depth[10/11]:  Updated constraints / Got system / Solved in:  0.155 /  0.727 /  2.916  (2131.812 MB)   Nodes: 5510320
Cycle[0] Depth[11/11]:  Updated constraints / Got system / Solved in:  0.005 /  1.457 /  5.325  (2454.875 MB)   Nodes: 10980384
# Linear system solved:      16.2 (s),    2454.9 (MB) /    2454.9 (MB) / 2454 (MB)
Got average:       0.2 (s),    1648.9 (MB) /    2454.9 (MB) / 2454 (MB)
Iso-Value: 4.992470e-01 = 4.11863e+06 / 8.24969e+06
[ERROR] Src/FEMTree.IsoSurface.specialized.inl (Line 1470)
        operator()
        Failed to close loop [8: 164 384 216] | (759336): (9508,11264,9928)[ERROR] Src/FEMTree.IsoSurface.specialized.inl (Line 1470)
        operator()
        Failed to close loop [8: 165 383 216] | (758827): (9512,11264,9924)

free(): corrupted unsorted chunks
Segmentation fault (core dumped)

I'll try now with --parallel 2

piotrmaslanka commented 3 years ago

--parallel 2 ends with following:

[39m[INFO]    running /code/SuperBuild/src/PoissonRecon/Bin/Linux/PoissonRecon --in /datasets/aukerman/odm_filterpoints/point_cloud.ply --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply --depth 11 --pointWeight 4.0 --samplesPerNode 1.0 --parallel 2 --threads 159 --linearFit --verboseESC[0m
*************************************************************
*************************************************************
** Running Screened Poisson Reconstruction (Version 13.72) **
*************************************************************
*************************************************************
        --in /datasets/aukerman/odm_filterpoints/point_cloud.ply
        --depth 11
        --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply
        --verbose
        --samplesPerNode 1.000000
        --pointWeight 4.000000
        --threads 159
        --linearFit
        --parallel 2
Input Points / Samples: 8372584 / 2413718
# Read input into tree:      34.7 (s),     511.6 (MB) /     511.6 (MB) / 532 (MB)
#   Got kernel density:       1.2 (s),     559.8 (MB) /     559.8 (MB) / 559 (MB)
#     Got normal field:      13.4 (s),    1130.9 (MB) /    1130.9 (MB) / 1140 (MB)
Point weight / Estimated Measure: 5.71955e-08 / 0.478874
#Initialized point interpolation constraints:       1.0 (s),    1173.0 (MB) /    1173.0 (MB) / 1182 (MB)
#       Finalized tree:      16.6 (s),    2048.7 (MB) /    2048.7 (MB) / 2048 (MB)
#  Set FEM constraints:       3.2 (s),    1720.5 (MB) /    2048.7 (MB) / 2048 (MB)
#Set point constraints:       0.2 (s),    1720.6 (MB) /    2048.7 (MB) / 2048 (MB)
Leaf Nodes / Active Nodes / Ghost Nodes / Dirichlet Supported Nodes: 16825460 / 19217488 / 11609 / 0
Memory Usage: 1720.562 MB
Cycle[0] Depth[ 5/11]:  Updated constraints / Got system / Solved in:  0.014 /  0.246 /  1.667  (1721.375 MB)   Nodes: 35937
Cycle[0] Depth[ 6/11]:  Updated constraints / Got system / Solved in:  0.023 /  0.230 /  2.925  (1721.438 MB)   Nodes: 38992
Cycle[0] Depth[ 7/11]:  Updated constraints / Got system / Solved in:  0.023 /  0.345 /  4.115  (1729.312 MB)   Nodes: 140400
Cycle[0] Depth[ 8/11]:  Updated constraints / Got system / Solved in:  0.028 /  0.579 /  8.355  (1772.875 MB)   Nodes: 521344
Cycle[0] Depth[ 9/11]:  Updated constraints / Got system / Solved in:  0.056 /  1.076 / 14.142  (1886.188 MB)   Nodes: 1851888
Cycle[0] Depth[10/11]:  Updated constraints / Got system / Solved in:  0.144 /  2.045 / 24.197  (2167.062 MB)   Nodes: 5558344
Cycle[0] Depth[11/11]:  Updated constraints / Got system / Solved in:  0.005 /  3.965 / 40.207  (2475.750 MB)   Nodes: 11048624
# Linear system solved:     106.0 (s),    2475.8 (MB) /    2475.8 (MB) / 2485 (MB)
Got average:       0.2 (s),    1531.7 (MB) /    2475.8 (MB) / 2485 (MB)
Iso-Value: 4.992406e-01 = 4.17993e+06 / 8.37258e+06
[ERROR] Src/FEMTree.IsoSurface.specialized.inl (Line 1470)
        operator()
        Failed to close loop [8: 161 375 222] | (764899): (9484,11200,9976)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
Aborted (core dumped)

I'm trying out the --schedule 0 option now

piotrmaslanka commented 3 years ago

Trying out with --schedule 0 --parallel 1 yields the following:

[39m[INFO]    running /code/SuperBuild/src/PoissonRecon/Bin/Linux/PoissonRecon --in /datasets/aukerman/odm_filterpoints/point_cloud.ply --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply --depth 11 --pointWeight 4.0 --samplesPerNode 1.0 --parallel 1 --schedule 0 --threads 159 --linearFit --verboseESC[0m
Segmentation fault (core dumped)
piotrmaslanka commented 3 years ago

Running with --schedule 0 --parallel 2 yields the following:

*************************************************************
** Running Screened Poisson Reconstruction (Version 13.72) **
*************************************************************
*************************************************************
        --in /datasets/aukerman/odm_filterpoints/point_cloud.ply
        --depth 11
        --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply
        --verbose
        --samplesPerNode 1.000000
        --pointWeight 4.000000
        --threads 159
        --linearFit
        --parallel 2
        --schedule 0
Input Points / Samples: 8292908 / 2401818
# Read input into tree:      34.3 (s),     477.8 (MB) /     477.8 (MB) / 536 (MB)
#   Got kernel density:       1.2 (s),     550.6 (MB) /     550.6 (MB) / 550 (MB)
#     Got normal field:      10.8 (s),    1130.4 (MB) /    1130.4 (MB) / 1139 (MB)
Point weight / Estimated Measure: 5.77709e-08 / 0.479089
#Initialized point interpolation constraints:       1.0 (s),    1171.9 (MB) /    1171.9 (MB) / 1181 (MB)
#       Finalized tree:      16.1 (s),    2038.0 (MB) /    2038.0 (MB) / 2038 (MB)
#  Set FEM constraints:       3.2 (s),    1715.9 (MB) /    2038.0 (MB) / 2038 (MB)
#Set point constraints:       0.2 (s),    1716.0 (MB) /    2038.0 (MB) / 2038 (MB)
Leaf Nodes / Active Nodes / Ghost Nodes / Dirichlet Supported Nodes: 16845606 / 19239784 / 12337 / 0
Memory Usage: 1716.000 MB
Cycle[0] Depth[ 5/11]:  Updated constraints / Got system / Solved in:  0.013 /  0.247 /  1.683  (1716.812 MB)   Nodes: 35937
Cycle[0] Depth[ 6/11]:  Updated constraints / Got system / Solved in:  0.020 /  0.244 /  2.992  (1716.875 MB)   Nodes: 40072
Cycle[0] Depth[ 7/11]:  Updated constraints / Got system / Solved in:  0.023 /  0.352 /  5.069  (1724.688 MB)   Nodes: 142584
Cycle[0] Depth[ 8/11]:  Updated constraints / Got system / Solved in:  0.029 /  0.596 /  8.794  (1766.062 MB)   Nodes: 526256
Cycle[0] Depth[ 9/11]:  Updated constraints / Got system / Solved in:  0.060 /  1.144 / 14.571  (1880.062 MB)   Nodes: 1859920
Cycle[0] Depth[10/11]:  Updated constraints / Got system / Solved in:  0.124 /  2.116 / 25.683  (2199.750 MB)   Nodes: 5561600
Cycle[0] Depth[11/11]:  Updated constraints / Got system / Solved in:  0.005 /  3.800 / 45.407  (2466.250 MB)   Nodes: 11051448
# Linear system solved:     114.6 (s),    2466.2 (MB) /    2466.2 (MB) / 2475 (MB)
Got average:       0.2 (s),    1526.6 (MB) /    2466.2 (MB) / 2475 (MB)
Iso-Value: 5.001062e-01 = 4.14733e+06 / 8.29291e+06
[ERROR] Src/FEMTree.IsoSurface.specialized.inl (Line 1470)
        operator()
        Failed to close loop [7: 149 191 105] | (241423): (10584,11264,9888)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
Aborted (core dumped)

Now I'm making a last try with --schedule 0 and --parallel 0.

piotrmaslanka commented 3 years ago

I'll try recompiling everything with -O0 and no --ffast-math or -funroll-loops

piotrmaslanka commented 3 years ago

Running with --schedule 0 and --parallel 0 yields:

*************************************************************
*************************************************************
** Running Screened Poisson Reconstruction (Version 13.72) **
*************************************************************
*************************************************************
        --in /datasets/aukerman/odm_filterpoints/point_cloud.ply
        --depth 11
        --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply
        --verbose
        --samplesPerNode 1.000000
        --pointWeight 4.000000
        --threads 159
        --linearFit
        --schedule 0
Input Points / Samples: 8401047 / 2376568
# Read input into tree:      34.8 (s),     484.5 (MB) /     484.5 (MB) / 541 (MB)
#   Got kernel density:       1.2 (s),     554.5 (MB) /     554.5 (MB) / 554 (MB)
#     Got normal field:      12.7 (s),    1090.5 (MB) /    1090.5 (MB) / 1090 (MB)
Point weight / Estimated Measure: 5.70303e-08 / 0.479114
#Initialized point interpolation constraints:       1.0 (s),    1128.6 (MB) /    1128.6 (MB) / 1128 (MB)
#       Finalized tree:      16.2 (s),    1947.3 (MB) /    1947.3 (MB) / 1947 (MB)
#  Set FEM constraints:       2.5 (s),    1638.7 (MB) /    1947.3 (MB) / 1947 (MB)
#Set point constraints:       0.1 (s),    1645.7 (MB) /    1947.3 (MB) / 1947 (MB)
Leaf Nodes / Active Nodes / Ghost Nodes / Dirichlet Supported Nodes: 16820469 / 19210784 / 12609 / 0
Memory Usage: 1645.688 MB
Cycle[0] Depth[ 5/11]:  Updated constraints / Got system / Solved in:  0.001 /  0.043 /  0.041  (1647.312 MB)   Nodes: 35937
Cycle[0] Depth[ 6/11]:  Updated constraints / Got system / Solved in:  0.003 /  0.027 /  0.071  (1647.312 MB)   Nodes: 40520
Cycle[0] Depth[ 7/11]:  Updated constraints / Got system / Solved in:  0.005 /  0.041 /  0.114  (1651.875 MB)   Nodes: 142912
Cycle[0] Depth[ 8/11]:  Updated constraints / Got system / Solved in:  0.009 /  0.090 /  0.224  (1671.125 MB)   Nodes: 525304
Cycle[0] Depth[ 9/11]:  Updated constraints / Got system / Solved in:  0.025 /  0.203 /  0.417  (1716.750 MB)   Nodes: 1859904
Cycle[0] Depth[10/11]:  Updated constraints / Got system / Solved in:  0.065 /  0.673 /  0.915  (2000.562 MB)   Nodes: 5562240
Cycle[0] Depth[11/11]:  Updated constraints / Got system / Solved in:  0.005 /  1.221 /  1.950  (2357.438 MB)   Nodes: 11022048
# Linear system solved:       7.8 (s),    2357.4 (MB) /    2357.4 (MB) / 2357 (MB)
Got average:       0.1 (s),    1383.2 (MB) /    2357.4 (MB) / 2357 (MB)
Iso-Value: 4.997036e-01 = 4.19803e+06 / 8.40105e+06
[ERROR] Src/FEMTree.IsoSurface.specialized.inl (Line 1470)
        operator()
        Failed to close loop [8: 130 180 247] | (944800): (9236,9634,10176)
piotrmaslanka commented 3 years ago

It worked being compiled on -O0 and --schedule 0!!

piotrmaslanka commented 3 years ago

It worked ran on -O0 with the default schedule of 1. It ran the following:

*************************************************************
** Running Screened Poisson Reconstruction (Version 13.72) **
*************************************************************
*************************************************************
        --in /datasets/aukerman/odm_filterpoints/point_cloud.ply
        --depth 11
        --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply
        --verbose
        --samplesPerNode 1.000000
        --pointWeight 4.000000
        --threads 159
        --linearFit
Input Points / Samples: 8310713 / 2352698
# Read input into tree:     153.9 (s),     488.4 (MB) /     488.4 (MB) / 541 (MB)
#   Got kernel density:       7.4 (s),     554.3 (MB) /     554.3 (MB) / 554 (MB)
#     Got normal field:      16.9 (s),    1125.8 (MB) /    1125.8 (MB) / 1125 (MB)
Point weight / Estimated Measure: 5.69867e-08 / 0.4736
#Initialized point interpolation constraints:       3.7 (s),    1166.4 (MB) /    1166.4 (MB) / 1166 (MB)
#       Finalized tree:      35.9 (s),    1927.6 (MB) /    1927.6 (MB) / 1927 (MB)
#  Set FEM constraints:      21.0 (s),    1662.2 (MB) /    1927.6 (MB) / 1927 (MB)
#Set point constraints:       0.9 (s),    1662.3 (MB) /    1927.6 (MB) / 1927 (MB)
Leaf Nodes / Active Nodes / Ghost Nodes / Dirichlet Supported Nodes: 16607165 / 18968168 / 11449 / 0
Memory Usage: 1662.312 MB
Cycle[0] Depth[ 5/11]:  Updated constraints / Got system / Solved in:  0.015 /  0.314 /  0.064  (1692.812 MB)   Nodes: 35937
Cycle[0] Depth[ 6/11]:  Updated constraints / Got system / Solved in:  0.044 /  0.154 /  0.108  (1692.812 MB)   Nodes: 39996
Cycle[0] Depth[ 7/11]:  Updated constraints / Got system / Solved in:  0.080 /  0.241 /  0.200  (1692.812 MB)   Nodes: 140784
Cycle[0] Depth[ 8/11]:  Updated constraints / Got system / Solved in:  0.169 /  0.511 /  0.415  (1694.000 MB)   Nodes: 520424
Cycle[0] Depth[ 9/11]:  Updated constraints / Got system / Solved in:  0.367 /  1.291 /  0.966  (1715.562 MB)   Nodes: 1840240
Cycle[0] Depth[10/11]:  Updated constraints / Got system / Solved in:  0.901 /  3.378 /  2.297  (2013.688 MB)   Nodes: 5495056
Cycle[0] Depth[11/11]:  Updated constraints / Got system / Solved in:  0.473 /  7.351 /  4.517  (2217.375 MB)   Nodes: 10873824
# Linear system solved:      32.8 (s),    2217.4 (MB) /    2217.4 (MB) / 2217 (MB)
Got average:       0.9 (s),    1379.4 (MB) /    2217.4 (MB) / 2217 (MB)
Iso-Value: 5.009327e-01 = 4.16311e+06 / 8.31071e+06
[WARNING] Src/FEMTree.IsoSurface.specialized.inl (Line 1886)
          Extract
          bad average roots: 6
Vertices / Polygons: 3097409 / 6193723
Corners / Vertices / Edges / Surface / Set Table / Copy Finer: 4.0 / 29.8 / 11.2 / 3.7 / 15.2 / 0.6 (s)
#        Got triangles:     157.4 (s),    2588.9 (MB) /    2588.9 (MB) / 2589 (MB)
#          Total Solve:     455.9 (s),    2588.9 (MB)

I will experiment a bit to see how much optimization can I enable before it starts breaking apart.

For the compilations i also replace -DRELEASE with -DDEBUG thanks to a tool called interceptor. My current interceptor configuration for g++ is as follows:

{
    "args_to_append": [
        "-DVL_DISABLE_SSE2",
        "-maltivec",
        "-mvsx",
        "-DNO_WARN_X86_INTRINSICS",
        "-lboost_system",
        "-lboost_filesystem",
        "-O1"
    ],
    "args_to_disable": [
        "-O0", "-O2", "-O3", "-funroll-loops", "-ffast-math",
        "-mfpmath=sse",
        "-msse",
        "-msse4.1",
        "-mssse3",
        "-msse2"
    ],
    "args_to_prepend": [],
    "args_to_replace": [
        [
            "-march=native",
            "-mcpu=native"
        ],
        ["-DRELEASE", "-DDEBUG"]
    ],
    "display_before_start": false,
    "notify_about_actions": false
}

One hint: On -O0 PoissonRecon spent a lot of time single-threaded, then it escalated to all 160 threads that I have, and then went again single-threaded, and so on.

piotrmaslanka commented 3 years ago

Compilation with -O1 fails with following:

*************************************************************
** Running Screened Poisson Reconstruction (Version 13.72) **
*************************************************************
*************************************************************
        --in /datasets/aukerman/odm_filterpoints/point_cloud.ply
        --depth 11
        --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply
        --verbose
        --samplesPerNode 1.000000
        --pointWeight 4.000000
        --threads 159
        --linearFit
Input Points / Samples: 8110991 / 2386928
# Read input into tree:      38.5 (s),     464.9 (MB) /     464.9 (MB) / 546 (MB)
#   Got kernel density:       1.4 (s),     534.4 (MB) /     534.4 (MB) / 546 (MB)
#     Got normal field:      11.3 (s),    1112.8 (MB) /    1112.8 (MB) / 1112 (MB)
Point weight / Estimated Measure: 5.85973e-08 / 0.475282
#Initialized point interpolation constraints:       1.1 (s),    1152.6 (MB) /    1152.6 (MB) / 1152 (MB)
#       Finalized tree:      17.1 (s),    1909.2 (MB) /    1909.2 (MB) / 1909 (MB)
#  Set FEM constraints:       2.9 (s),    1585.3 (MB) /    1909.2 (MB) / 1909 (MB)
#Set point constraints:       0.1 (s),    1595.0 (MB) /    1909.2 (MB) / 1909 (MB)
Leaf Nodes / Active Nodes / Ghost Nodes / Dirichlet Supported Nodes: 16722056 / 19099448 / 11473 / 0
Memory Usage: 1595.000 MB
Cycle[0] Depth[ 5/11]:  Updated constraints / Got system / Solved in:  0.002 /  0.047 /  0.045  (1623.750 MB)   Nodes: 35937
Cycle[0] Depth[ 6/11]:  Updated constraints / Got system / Solved in:  0.003 /  0.035 /  0.072  (1627.812 MB)   Nodes: 39880
Cycle[0] Depth[ 7/11]:  Updated constraints / Got system / Solved in:  0.006 /  0.056 /  0.117  (1627.812 MB)   Nodes: 140504
Cycle[0] Depth[ 8/11]:  Updated constraints / Got system / Solved in:  0.015 /  0.116 /  0.234  (1637.750 MB)   Nodes: 523048
Cycle[0] Depth[ 9/11]:  Updated constraints / Got system / Solved in:  0.034 /  0.306 /  0.475  (1719.250 MB)   Nodes: 1849472
Cycle[0] Depth[10/11]:  Updated constraints / Got system / Solved in:  0.089 /  0.988 /  1.084  (1985.938 MB)   Nodes: 5536592
Cycle[0] Depth[11/11]:  Updated constraints / Got system / Solved in:  0.017 /  1.935 /  2.064  (2255.688 MB)   Nodes: 10952008
# Linear system solved:       9.4 (s),    2255.7 (MB) /    2255.7 (MB) / 2255 (MB)
Got average:       0.1 (s),    1404.3 (MB) /    2255.7 (MB) / 2255 (MB)
Iso-Value: 4.988781e-01 = 4.0464e+06 / 8.11099e+06
[ERROR] Src/FEMTree.IsoSurface.specialized.inl (Line 1470)
        operator()
        Failed to close loop [9: 279 186 488] | (2725493): (9311,8938,10148)
piotrmaslanka commented 3 years ago

I will do some research concerning what flags is it safe to compile, but for now the fastest results are on default compilation flags and --threads 1

piotrmaslanka commented 3 years ago

Note that this post will be edited to provide you with the latest of my findings. -ffast-math seems to be safe to compile with, but PoissonRecon outputted the following warning:

*************************************************************
** Running Screened Poisson Reconstruction (Version 13.72) **
*************************************************************
*************************************************************
        --in /datasets/aukerman/odm_filterpoints/point_cloud.ply
        --depth 11
        --out /datasets/aukerman/odm_meshing/odm_mesh.dirty.ply
        --verbose
        --samplesPerNode 1.000000
        --pointWeight 4.000000
        --threads 159
        --linearFit
Input Points / Samples: 8199413 / 2395705
# Read input into tree:     144.2 (s),     486.0 (MB) /     486.0 (MB) / 531 (MB)
#   Got kernel density:       7.5 (s),     555.5 (MB) /     555.5 (MB) / 555 (MB)
#     Got normal field:      16.9 (s),    1137.0 (MB) /    1137.0 (MB) / 1137 (MB)
Point weight / Estimated Measure: 5.7875e-08 / 0.474541
#Initialized point interpolation constraints:       3.8 (s),    1178.4 (MB) /    1178.4 (MB) / 1178 (MB)
#       Finalized tree:      38.9 (s),    1933.4 (MB) /    1933.4 (MB) / 1933 (MB)
#  Set FEM constraints:      21.7 (s),    1637.9 (MB) /    1933.4 (MB) / 1933 (MB)
#Set point constraints:       0.9 (s),    1669.2 (MB) /    1933.4 (MB) / 1933 (MB)
Leaf Nodes / Active Nodes / Ghost Nodes / Dirichlet Supported Nodes: 16761361 / 19142768 / 13073 / 0
Memory Usage: 1669.188 MB
Cycle[0] Depth[ 5/11]:  Updated constraints / Got system / Solved in:  0.015 /  0.336 /  0.064  (1698.625 MB)   Nodes: 35937
Cycle[0] Depth[ 6/11]:  Updated constraints / Got system / Solved in:  0.043 /  0.162 /  0.116  (1701.688 MB)   Nodes: 40316
Cycle[0] Depth[ 7/11]:  Updated constraints / Got system / Solved in:  0.081 /  0.253 /  0.270  (1704.438 MB)   Nodes: 141328
Cycle[0] Depth[ 8/11]:  Updated constraints / Got system / Solved in:  0.162 /  0.511 /  0.415  (1708.562 MB)   Nodes: 526208
Cycle[0] Depth[ 9/11]:  Updated constraints / Got system / Solved in:  0.368 /  1.300 /  0.934  (1764.125 MB)   Nodes: 1860200
Cycle[0] Depth[10/11]:  Updated constraints / Got system / Solved in:  0.905 /  3.429 /  2.288  (1990.688 MB)   Nodes: 5534280
Cycle[0] Depth[11/11]:  Updated constraints / Got system / Solved in:  0.471 /  7.396 /  4.362  (2265.938 MB)   Nodes: 10982576
# Linear system solved:      33.2 (s),    2265.9 (MB) /    2265.9 (MB) / 2265 (MB)
Got average:       0.9 (s),    1369.1 (MB) /    2265.9 (MB) / 2265 (MB)
Iso-Value: 4.999796e-01 = 4.09954e+06 / 8.19941e+06
[WARNING] Src/FEMTree.IsoSurface.specialized.inl (Line 1886)
          Extract
          bad average roots: 4
Vertices / Polygons: 3118225 / 6235431
Corners / Vertices / Edges / Surface / Set Table / Copy Finer: 4.5 / 30.4 / 11.5 / 3.8 / 14.7 / 0.5 (s)
#        Got triangles:     161.0 (s),    2728.4 (MB) /    2728.4 (MB) / 2728 (MB)
#          Total Solve:     454.2 (s),    2728.4 (MB)

-ftree-vectorize is also safe for use. Note that there is no actual benefit in sense of time spent on calculations from using those.

piotrmaslanka commented 3 years ago

Possibly related to FFTW/fftw3#145. Possibly related to MONGO-27906.

piotrmaslanka commented 3 years ago

Compiling this on -O1 is possible when I later order GCC not to perform any optimisations that it would normally perform. I cannot point out a single flag that does it, because if I take any, the software starts failing to close loops, or worse yet, segfaults.

piotrmaslanka commented 3 years ago

Although it is still possible to invoke O0 and then manually enable specific optimizations. I'm working to figure out the maximum set of optimizations that will work.

piotrmaslanka commented 3 years ago

*So far compiling with -O0 and the following flags is feasible:

Note that when -DRELEASE is not replaced with -DDEBUG this fails to close loops.

piotrmaslanka commented 3 years ago

Here's the final configuration of interceptor what I managed to compile PoissonRecon under ppc64le such that it actually worked:

You have to intercept g++ and then put the following in /etc/interceptor.d/g++:

{
    "args_to_append": [
        "-DVL_DISABLE_SSE2",
        "-maltivec",
        "-mabi=altivec",
        "-mvsx",
        "-DNO_WARN_X86_INTRINSICS",
        "-lboost_system",
        "-lboost_filesystem",
        "-ftree-dce",
        "-ffast-math",
        "-ftree-dse",
        "-ftree-vectorize",
        "-funroll-loops",
        "-fdce",
        "-fmerge-constants",
        "-fmove-loop-invariants",
        "-fomit-frame-pointer",
        "-freorder-blocks",
        "-fshrink-wrap",
        "-fshrink-wrap-separate",
        "-fsplit-wide-types",
        "-fforward-propagate",
        "-fguess-branch-probability",
        "-fif-conversion",
        "-fif-conversion2",
        "-finline-functions-called-once",
        "-fipa-icf",
        "-fipa-profile",
        "-fipa-pure-const",
        "-fipa-reference",
        "-fipa-reference-addressable",
        "-fauto-inc-dec",
        "-fbranch-count-reg",
        "-fcombine-stack-adjustments",
        "-fcompare-elim",
        "-fcprop-registers",
        "-fssa-backprop",
        "-fssa-phiopt",
        "-ftree-bit-ccp",
        "-ftree-ccp",
        "-ftree-ch",
        "-ftree-coalesce-vars",
        "-ftree-copy-prop",
       "-ftree-forwprop",
        "-ftree-fre",
        "-ftree-phiprop",
        "-ftree-pta",
        "-ftree-scev-cprop",
        "-ftree-sink",
        "-ftree-slsr",
        "-ftree-sra",
        "-ftree-ter",
        "-funit-at-a-time"
    ],
    "args_to_disable": [
        "-O0",
        "-O2",
        "-O1",
        "-O3",
        "-mfpmath=sse",
        "-msse",
        "-msse4.1",
        "-mssse3",
        "-msse2"
    ],
    "args_to_prepend": [],
    "args_to_replace": [
        [
            "-g",
            "-g0"
        ],
        [
            "-march=native",
            "-mcpu=native"
        ],
        [
            "-DRELEASE",
            "-DDEBUG"
        ]
    ],
    "deduplication": false,
    "display_before_start": true,
    "log": false,
    "notify_about_actions": false
}

It's still slower than a single threaded PoissonRecon on -O3 though.

pierotofy commented 3 years ago

As a heads up, this seems to happen on arm64 (e.g. Apple M1) too. Same error. I will try to investigate the issue further and open a PR if I find a fix.

pierotofy commented 3 years ago

I think I narrowed this down to the following lines: https://github.com/OpenDroneMap/PoissonRecon/blob/3c9227d4f66f5b07d5db1343d9d6806250dff23b/Src/FEMTree.IsoSurface.specialized.inl#L850-L866

Adding a "#pragma omp critical" or removing the parent Parallel_for loop prevents the errors from happening on arm64, but also slows down computation on x64 (for obvious reasons).

It looks to me that vIndex could be identical between multiple threads (?), thus causing _getCornerValues to compute different values from different threads.

More debugging is needed.

What's really strange is that this doesn't happen on x64. If this was just a race condition it would happen on all machines. There might be something else going on.

piotrmaslanka commented 3 years ago

@pierotofy I could arrange an access to a POWER9 machine for you it that would help.

mkazhdan commented 3 years ago

Thanks. Things are a bit busy right now, but I would definitely be interested in taking the opportunity a little later down the road.

piotrmaslanka commented 3 years ago

I'll whip you up access as soon as I figure out how to install a normal Debian on an AC922 that IBM sold us. It ships without even a C compiler, can you believe it? Give me a holler at piotr.maslanka@dronehub.ai for details.

cpheinrich commented 3 years ago

I think I narrowed this down to the following lines: https://github.com/OpenDroneMap/PoissonRecon/blob/3c9227d4f66f5b07d5db1343d9d6806250dff23b/Src/FEMTree.IsoSurface.specialized.inl#L850-L866

Adding a "#pragma omp critical" or removing the parent Parallel_for loop prevents the errors from happening on arm64, but also slows down computation on x64 (for obvious reasons).

It looks to me that vIndex could be identical between multiple threads (?), thus causing _getCornerValues to compute different values from different threads.

More debugging is needed.

What's really strange is that this doesn't happen on x64. If this was just a race condition it would happen on all machines. There might be something else going on.

I have also encountered the same issue on arm64. Adding the #pragma omp critical did not fix it for me, but changing the parallel_for loop on line 829 to a serial for loop did fix the issue and I am now able to run poisson reconstruction with multiple threads on arm w/o crashing. Given this is just one of many parallel_for loops, the overall slowdown by making this one serial is negligible. Have been wanting a fix for this for a while -- thanks for sharing!

cpheinrich commented 3 years ago

Also worth mentioning that I found the run time was a few percent faster (4.69 s vs 4.829 s on one sample) on arm when using tbb for multithreading, and added the following to Parallel_for in MyMiscellany.h just before the #ifdef _OPENMP

      bool use_tbb = true;
      if (use_tbb) {
          tbb::parallel_for(
              tbb::blocked_range<std::size_t>(begin, static_cast<std::size_t>(end)),
              [&](tbb::blocked_range<std::size_t> r) {
              for (size_t idx = r.begin(); idx != r.end(); ++idx) {
                  int thread = tbb::task_arena::current_thread_index();
                  iterationFunction(thread, idx);
              }
          });
      }

so if you already have tbb linked in your project its an easy change, but if you don't then its probably not worth it

piotrmaslanka commented 2 years ago

Sorry guys, but I no longer work at the place where I had access to an AC922. I can contact you with some guys from that place (Dronehub) who might show marginal interests, but since nobody was able to take the topic over with setting up a new machine the project basically died. Sorry for the commotion.

piotrmaslanka commented 2 years ago

That's right @cpheinrich and @ababo I will consider this closed until I know otherwise. Thank you in advance 🙇

piotrmaslanka commented 2 years ago

@ababo what about the PowerPC versions? Will that work as well?